API Reference

Evaluation

Scorio.bayesFunction
bayes(R::AbstractMatrix{<:Integer}, w::AbstractVector{<:Real}, R0::Union{AbstractMatrix{<:Integer}, Nothing}=nothing) -> Tuple{Float64, Float64}

Performance evaluation using the Bayes@N framework.

References

Hariri, M., Samandar, A., Hinczewski, M., & Chaudhary, V. (2025). Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation. arXiv preprint arXiv:2510.04265. https://arxiv.org/abs/2510.04265

Arguments

  • R::AbstractMatrix{<:Integer}: $M \times N$ int matrix with entries in $\{0,\ldots,C\}$. Row $\alpha$ are the N outcomes for question $\alpha$.
  • w::AbstractVector{<:Real}: length $(C+1)$ weight vector $(w_0,\ldots,w_C)$ that maps category k to score $w_k$.
  • R0::Union{AbstractMatrix{<:Integer}, Nothing}: optional $M \times D$ int matrix supplying D prior outcomes per row. If omitted, $D=0$.

Returns

  • Tuple{Float64, Float64}: $(\mu, \sigma)$ performance metric estimate and its uncertainty.

Notation

$\delta_{a,b}$ is the Kronecker delta. For each row $\alpha$ and class $k \in \{0,\ldots,C\}$:

\[n_{\alpha k} = \sum_{i=1}^N \delta_{k, R_{\alpha i}} \quad \text{(counts in R)}\]

\[n^0_{\alpha k} = 1 + \sum_{i=1}^D \delta_{k, R^0_{\alpha i}} \quad \text{(Dirichlet(+1) prior)}\]

\[\nu_{\alpha k} = n_{\alpha k} + n^0_{\alpha k}\]

Effective sample size: $T = 1 + C + D + N$ (scalar)

Formula

\[\mu = w_0 + \frac{1}{M \cdot T} \sum_{\alpha=1}^M \sum_{j=0}^C \nu_{\alpha j} (w_j - w_0)\]

\[\sigma = \sqrt{ \frac{1}{M^2(T+1)} \sum_{\alpha=1}^M \left[ \sum_j \frac{\nu_{\alpha j}}{T} (w_j - w_0)^2 - \left( \sum_j \frac{\nu_{\alpha j}}{T} (w_j - w_0) \right)^2 \right] }\]

Examples

R = [0 1 2 2 1;
     1 1 0 2 2]
w = [0.0, 0.5, 1.0]
R0 = [0 2;
      1 2]

# With prior (D=2 → T=10)
mu, sigma = bayes(R, w, R0)
# Expected: mu ≈ 0.575, sigma ≈ 0.084275

# Without prior (D=0 → T=8)
mu2, sigma2 = bayes(R, w)
# Expected: mu2 ≈ 0.5625, sigma2 ≈ 0.091998
source
Scorio.avgFunction
avg(R::AbstractArray{<:Real}) -> Float64

Simple average of all entries in R.

Computes the arithmetic mean of all entries in the result matrix.

Arguments

  • R::AbstractArray{<:Real}: $M \times N$ result matrix with entries in $\{0, 1\}$. Row $\alpha$ are the N outcomes for question $\alpha$.

Returns

  • Float64: The arithmetic mean of all entries in R.

Notation

$R_{\alpha i}$ is the outcome for question $\alpha$ on trial $i$.

Formula

\[\text{avg} = \frac{1}{M \cdot N} \sum_{\alpha=1}^{M} \sum_{i=1}^{N} R_{\alpha i}\]

Examples

R = [0 1 1 0 1;
     1 1 0 1 1]
avg(R)  # Returns 0.7
source
Scorio.pass_at_kFunction
pass_at_k(R::AbstractMatrix{<:Integer}, k::Integer) -> Float64

Unbiased Pass@k estimator.

Computes the probability that at least one of k randomly selected samples is correct, averaged over all M questions.

References

Chen, M., Tworek, J., Jun, H., et al. (2021). Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374. https://arxiv.org/abs/2107.03374

Arguments

  • R::AbstractMatrix{<:Integer}: $M \times N$ binary matrix with entries in $\{0, 1\}$. $R_{\alpha i} = 1$ if trial $i$ for question $\alpha$ passed, 0 otherwise.
  • k::Integer: Number of samples to select ($1 \le k \le N$).

Returns

  • Float64: The average Pass@k score across all M questions.

Notation

For each row $\alpha$:

\[\nu_\alpha = \sum_{i=1}^{N} R_{\alpha i} \quad \text{(number of correct samples)}\]

$C(a, b)$ denotes the binomial coefficient $\binom{a}{b}$.

Formula

\[\text{Pass@k}_\alpha = 1 - \frac{C(N - \nu_\alpha, k)}{C(N, k)}\]

\[\text{Pass@k} = \frac{1}{M} \sum_{\alpha=1}^{M} \text{Pass@k}_\alpha\]

Examples

R = [0 1 1 0 1;
     1 1 0 1 1]
pass_at_k(R, 1)  # Returns 0.7
pass_at_k(R, 2)  # Returns 0.95
source
Scorio.pass_hat_kFunction
pass_hat_k(R::AbstractMatrix{<:Integer}, k::Integer) -> Float64

Pass^k (Pass-hat@k): probability that all k selected trials are correct.

Computes the probability that k randomly selected samples are ALL correct, averaged over all M questions. Also known as G-Pass@k.

References

Yao, S., Shinn, N., Razavi, P., & Narasimhan, K. (2024). τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv preprint arXiv:2406.12045. https://arxiv.org/abs/2406.12045

Arguments

  • R::AbstractMatrix{<:Integer}: $M \times N$ binary matrix with entries in $\{0, 1\}$. $R_{\alpha i} = 1$ if trial $i$ for question $\alpha$ passed, 0 otherwise.
  • k::Integer: Number of samples to select ($1 \le k \le N$).

Returns

  • Float64: The average Pass^k score across all M questions.

Notation

For each row $\alpha$:

\[\nu_\alpha = \sum_{i=1}^{N} R_{\alpha i} \quad \text{(number of correct samples)}\]

$C(a, b)$ denotes the binomial coefficient $\binom{a}{b}$.

Formula

\[\text{Pass}\hat{\text{@}}\text{k}_\alpha = \frac{C(\nu_\alpha, k)}{C(N, k)}\]

\[\text{Pass}\hat{\text{@}}\text{k} = \frac{1}{M} \sum_{\alpha=1}^{M} \text{Pass}\hat{\text{@}}\text{k}_\alpha\]

Examples

R = [0 1 1 0 1;
     1 1 0 1 1]
pass_hat_k(R, 1)  # Returns 0.7
pass_hat_k(R, 2)  # Returns 0.45
source
Scorio.g_pass_at_kFunction
g_pass_at_k(R::AbstractMatrix{<:Integer}, k::Integer) -> Float64

Alias for pass_hat_k. See pass_hat_k for documentation.

This function is provided for compatibility with literature that uses the G-Pass@k naming convention.

source
Scorio.g_pass_at_k_taoFunction
g_pass_at_k_tao(R::AbstractMatrix{<:Integer}, k::Integer, tao::Real) -> Float64

G-Pass@k_τ: Generalized Pass@k with threshold τ.

Computes the probability that at least $\lceil \tau \cdot k \rceil$ of k randomly selected samples are correct, averaged over all M questions.

References

Liu, J., Liu, H., Xiao, L., et al. (2024). Are Your LLMs Capable of Stable Reasoning? arXiv preprint arXiv:2412.13147. https://arxiv.org/abs/2412.13147

Arguments

  • R::AbstractMatrix{<:Integer}: $M \times N$ binary matrix with entries in $\{0, 1\}$. $R_{\alpha i} = 1$ if trial $i$ for question $\alpha$ passed, 0 otherwise.
  • k::Integer: Number of samples to select ($1 \le k \le N$).
  • tao::Real: Threshold parameter $\tau \in [0, 1]$. Requires at least $\lceil \tau \cdot k \rceil$ successes. When $\tau = 0$, equivalent to Pass@k. When $\tau = 1$, equivalent to Pass^k.

Returns

  • Float64: The average G-Pass@k_τ score across all M questions.

Notation

For each row $\alpha$:

\[\nu_\alpha = \sum_{i=1}^{N} R_{\alpha i} \quad \text{(number of correct samples)}\]

$C(a, b)$ denotes the binomial coefficient $\binom{a}{b}$.

$j_0 = \lceil \tau \cdot k \rceil$ is the minimum number of successes required.

Formula

\[\text{G-Pass@k}_{\tau, \alpha} = \sum_{j=j_0}^{k} \frac{C(\nu_\alpha, j) \cdot C(N - \nu_\alpha, k - j)}{C(N, k)}\]

\[\text{G-Pass@k}_\tau = \frac{1}{M} \sum_{\alpha=1}^{M} \text{G-Pass@k}_{\tau, \alpha}\]

Examples

R = [0 1 1 0 1;
     1 1 0 1 1]
g_pass_at_k_tao(R, 2, 0.5)  # Returns ≈ 0.95
g_pass_at_k_tao(R, 2, 1.0)  # Returns ≈ 0.45
source
Scorio.mg_pass_at_kFunction
mg_pass_at_k(R::AbstractMatrix{<:Integer}, k::Integer) -> Float64

mG-Pass@k: mean Generalized Pass@k.

Computes the mean of G-Pass@k_τ over the range τ ∈ [0.5, 1.0], inspired by the mean Average Precision (mAP) metric. This provides a comprehensive metric that integrates performance potential and stability across multiple thresholds.

References

Liu, J., Liu, H., Xiao, L., et al. (2024). Are Your LLMs Capable of Stable Reasoning? arXiv preprint arXiv:2412.13147. https://arxiv.org/abs/2412.13147

Arguments

  • R::AbstractMatrix{<:Integer}: $M \times N$ binary matrix with entries in $\{0, 1\}$. $R_{\alpha i} = 1$ if trial $i$ for question $\alpha$ passed, 0 otherwise.
  • k::Integer: Number of samples to select ($1 \le k \le N$).

Returns

  • Float64: The average mG-Pass@k score across all M questions.

Notation

For each row $\alpha$:

\[\nu_\alpha = \sum_{i=1}^{N} R_{\alpha i} \quad \text{(number of correct samples)}\]

$m = \lceil k/2 \rceil$ is the majority threshold (the integration starts at $\tau = 0.5$).

The metric is defined as the integral of G-Pass@k_τ over τ ∈ [0.5, 1.0]:

\[\text{mG-Pass@k} = 2 \int_{0.5}^{1.0} \text{G-Pass@k}_\tau \, d\tau\]

Formula

The discrete approximation used in computation:

\[\text{mG-Pass@k}_\alpha = \frac{2}{k} \sum_{j=m+1}^{k} (j - m) \cdot P(X = j)\]

where $X \sim \text{Hypergeometric}(N, \nu_\alpha, k)$ and the probability mass function is:

\[P(X = j) = \frac{C(\nu_\alpha, j) \cdot C(N - \nu_\alpha, k - j)}{C(N, k)}\]

The final metric is averaged over all questions:

\[\text{mG-Pass@k} = \frac{1}{M} \sum_{\alpha=1}^{M} \text{mG-Pass@k}_\alpha\]

Examples

R = [0 1 1 0 1;
     1 1 0 1 1]
mg_pass_at_k(R, 2)  # Returns ≈ 0.45
mg_pass_at_k(R, 3)  # Returns ≈ 0.166667
source

Ranking

Scorio.competition_ranks_from_scoresFunction
competition_ranks_from_scores(scores_in_id_order::AbstractVector{<:Real}; tol::Real=1e-12) -> Vector{Int}

Compute competition ranks from scores.

Given L models with ids 1..L and their scores, returns competition ranks (1,2,3,3,5,...). Models with tied scores (within tolerance) receive the same rank.

Arguments

  • scores_in_id_order::AbstractVector{<:Real}: list/array of scores aligned to ids 1..L
  • tol::Real=1e-12: tolerance for considering scores as tied

Returns

  • Vector{Int}: competition ranks for each model

Examples

scores = [0.95, 0.87, 0.87, 0.72, 0.65]
ranks = competition_ranks_from_scores(scores)
# Returns: [1, 2, 2, 4, 5]
source