API Reference
Evaluation
Scorio.bayes — Function
bayes(R::AbstractMatrix{<:Integer}, w::AbstractVector{<:Real}, R0::Union{AbstractMatrix{<:Integer}, Nothing}=nothing) -> Tuple{Float64, Float64}Performance evaluation using the Bayes@N framework.
References
Hariri, M., Samandar, A., Hinczewski, M., & Chaudhary, V. (2025). Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation. arXiv preprint arXiv:2510.04265. https://arxiv.org/abs/2510.04265
Arguments
R::AbstractMatrix{<:Integer}: $M \times N$ int matrix with entries in $\{0,\ldots,C\}$. Row $\alpha$ are the N outcomes for question $\alpha$.w::AbstractVector{<:Real}: length $(C+1)$ weight vector $(w_0,\ldots,w_C)$ that maps category k to score $w_k$.R0::Union{AbstractMatrix{<:Integer}, Nothing}: optional $M \times D$ int matrix supplying D prior outcomes per row. If omitted, $D=0$.
Returns
Tuple{Float64, Float64}: $(\mu, \sigma)$ performance metric estimate and its uncertainty.
Notation
$\delta_{a,b}$ is the Kronecker delta. For each row $\alpha$ and class $k \in \{0,\ldots,C\}$:
\[n_{\alpha k} = \sum_{i=1}^N \delta_{k, R_{\alpha i}} \quad \text{(counts in R)}\]
\[n^0_{\alpha k} = 1 + \sum_{i=1}^D \delta_{k, R^0_{\alpha i}} \quad \text{(Dirichlet(+1) prior)}\]
\[\nu_{\alpha k} = n_{\alpha k} + n^0_{\alpha k}\]
Effective sample size: $T = 1 + C + D + N$ (scalar)
Formula
\[\mu = w_0 + \frac{1}{M \cdot T} \sum_{\alpha=1}^M \sum_{j=0}^C \nu_{\alpha j} (w_j - w_0)\]
\[\sigma = \sqrt{ \frac{1}{M^2(T+1)} \sum_{\alpha=1}^M \left[ \sum_j \frac{\nu_{\alpha j}}{T} (w_j - w_0)^2 - \left( \sum_j \frac{\nu_{\alpha j}}{T} (w_j - w_0) \right)^2 \right] }\]
Examples
R = [0 1 2 2 1;
1 1 0 2 2]
w = [0.0, 0.5, 1.0]
R0 = [0 2;
1 2]
# With prior (D=2 → T=10)
mu, sigma = bayes(R, w, R0)
# Expected: mu ≈ 0.575, sigma ≈ 0.084275
# Without prior (D=0 → T=8)
mu2, sigma2 = bayes(R, w)
# Expected: mu2 ≈ 0.5625, sigma2 ≈ 0.091998Scorio.avg — Function
avg(R::AbstractArray{<:Real}) -> Float64Simple average of all entries in R.
Computes the arithmetic mean of all entries in the result matrix.
Arguments
R::AbstractArray{<:Real}: $M \times N$ result matrix with entries in $\{0, 1\}$. Row $\alpha$ are the N outcomes for question $\alpha$.
Returns
Float64: The arithmetic mean of all entries in R.
Notation
$R_{\alpha i}$ is the outcome for question $\alpha$ on trial $i$.
Formula
\[\text{avg} = \frac{1}{M \cdot N} \sum_{\alpha=1}^{M} \sum_{i=1}^{N} R_{\alpha i}\]
Examples
R = [0 1 1 0 1;
1 1 0 1 1]
avg(R) # Returns 0.7Scorio.pass_at_k — Function
pass_at_k(R::AbstractMatrix{<:Integer}, k::Integer) -> Float64Unbiased Pass@k estimator.
Computes the probability that at least one of k randomly selected samples is correct, averaged over all M questions.
References
Chen, M., Tworek, J., Jun, H., et al. (2021). Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374. https://arxiv.org/abs/2107.03374
Arguments
R::AbstractMatrix{<:Integer}: $M \times N$ binary matrix with entries in $\{0, 1\}$. $R_{\alpha i} = 1$ if trial $i$ for question $\alpha$ passed, 0 otherwise.k::Integer: Number of samples to select ($1 \le k \le N$).
Returns
Float64: The average Pass@k score across all M questions.
Notation
For each row $\alpha$:
\[\nu_\alpha = \sum_{i=1}^{N} R_{\alpha i} \quad \text{(number of correct samples)}\]
$C(a, b)$ denotes the binomial coefficient $\binom{a}{b}$.
Formula
\[\text{Pass@k}_\alpha = 1 - \frac{C(N - \nu_\alpha, k)}{C(N, k)}\]
\[\text{Pass@k} = \frac{1}{M} \sum_{\alpha=1}^{M} \text{Pass@k}_\alpha\]
Examples
R = [0 1 1 0 1;
1 1 0 1 1]
pass_at_k(R, 1) # Returns 0.7
pass_at_k(R, 2) # Returns 0.95Scorio.pass_hat_k — Function
pass_hat_k(R::AbstractMatrix{<:Integer}, k::Integer) -> Float64Pass^k (Pass-hat@k): probability that all k selected trials are correct.
Computes the probability that k randomly selected samples are ALL correct, averaged over all M questions. Also known as G-Pass@k.
References
Yao, S., Shinn, N., Razavi, P., & Narasimhan, K. (2024). τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv preprint arXiv:2406.12045. https://arxiv.org/abs/2406.12045
Arguments
R::AbstractMatrix{<:Integer}: $M \times N$ binary matrix with entries in $\{0, 1\}$. $R_{\alpha i} = 1$ if trial $i$ for question $\alpha$ passed, 0 otherwise.k::Integer: Number of samples to select ($1 \le k \le N$).
Returns
Float64: The average Pass^k score across all M questions.
Notation
For each row $\alpha$:
\[\nu_\alpha = \sum_{i=1}^{N} R_{\alpha i} \quad \text{(number of correct samples)}\]
$C(a, b)$ denotes the binomial coefficient $\binom{a}{b}$.
Formula
\[\text{Pass}\hat{\text{@}}\text{k}_\alpha = \frac{C(\nu_\alpha, k)}{C(N, k)}\]
\[\text{Pass}\hat{\text{@}}\text{k} = \frac{1}{M} \sum_{\alpha=1}^{M} \text{Pass}\hat{\text{@}}\text{k}_\alpha\]
Examples
R = [0 1 1 0 1;
1 1 0 1 1]
pass_hat_k(R, 1) # Returns 0.7
pass_hat_k(R, 2) # Returns 0.45Scorio.g_pass_at_k — Function
g_pass_at_k(R::AbstractMatrix{<:Integer}, k::Integer) -> Float64Alias for pass_hat_k. See pass_hat_k for documentation.
This function is provided for compatibility with literature that uses the G-Pass@k naming convention.
Scorio.g_pass_at_k_tao — Function
g_pass_at_k_tao(R::AbstractMatrix{<:Integer}, k::Integer, tao::Real) -> Float64G-Pass@k_τ: Generalized Pass@k with threshold τ.
Computes the probability that at least $\lceil \tau \cdot k \rceil$ of k randomly selected samples are correct, averaged over all M questions.
References
Liu, J., Liu, H., Xiao, L., et al. (2024). Are Your LLMs Capable of Stable Reasoning? arXiv preprint arXiv:2412.13147. https://arxiv.org/abs/2412.13147
Arguments
R::AbstractMatrix{<:Integer}: $M \times N$ binary matrix with entries in $\{0, 1\}$. $R_{\alpha i} = 1$ if trial $i$ for question $\alpha$ passed, 0 otherwise.k::Integer: Number of samples to select ($1 \le k \le N$).tao::Real: Threshold parameter $\tau \in [0, 1]$. Requires at least $\lceil \tau \cdot k \rceil$ successes. When $\tau = 0$, equivalent to Pass@k. When $\tau = 1$, equivalent to Pass^k.
Returns
Float64: The average G-Pass@k_τ score across all M questions.
Notation
For each row $\alpha$:
\[\nu_\alpha = \sum_{i=1}^{N} R_{\alpha i} \quad \text{(number of correct samples)}\]
$C(a, b)$ denotes the binomial coefficient $\binom{a}{b}$.
$j_0 = \lceil \tau \cdot k \rceil$ is the minimum number of successes required.
Formula
\[\text{G-Pass@k}_{\tau, \alpha} = \sum_{j=j_0}^{k} \frac{C(\nu_\alpha, j) \cdot C(N - \nu_\alpha, k - j)}{C(N, k)}\]
\[\text{G-Pass@k}_\tau = \frac{1}{M} \sum_{\alpha=1}^{M} \text{G-Pass@k}_{\tau, \alpha}\]
Examples
R = [0 1 1 0 1;
1 1 0 1 1]
g_pass_at_k_tao(R, 2, 0.5) # Returns ≈ 0.95
g_pass_at_k_tao(R, 2, 1.0) # Returns ≈ 0.45Scorio.mg_pass_at_k — Function
mg_pass_at_k(R::AbstractMatrix{<:Integer}, k::Integer) -> Float64mG-Pass@k: mean Generalized Pass@k.
Computes the mean of G-Pass@k_τ over the range τ ∈ [0.5, 1.0], inspired by the mean Average Precision (mAP) metric. This provides a comprehensive metric that integrates performance potential and stability across multiple thresholds.
References
Liu, J., Liu, H., Xiao, L., et al. (2024). Are Your LLMs Capable of Stable Reasoning? arXiv preprint arXiv:2412.13147. https://arxiv.org/abs/2412.13147
Arguments
R::AbstractMatrix{<:Integer}: $M \times N$ binary matrix with entries in $\{0, 1\}$. $R_{\alpha i} = 1$ if trial $i$ for question $\alpha$ passed, 0 otherwise.k::Integer: Number of samples to select ($1 \le k \le N$).
Returns
Float64: The average mG-Pass@k score across all M questions.
Notation
For each row $\alpha$:
\[\nu_\alpha = \sum_{i=1}^{N} R_{\alpha i} \quad \text{(number of correct samples)}\]
$m = \lceil k/2 \rceil$ is the majority threshold (the integration starts at $\tau = 0.5$).
The metric is defined as the integral of G-Pass@k_τ over τ ∈ [0.5, 1.0]:
\[\text{mG-Pass@k} = 2 \int_{0.5}^{1.0} \text{G-Pass@k}_\tau \, d\tau\]
Formula
The discrete approximation used in computation:
\[\text{mG-Pass@k}_\alpha = \frac{2}{k} \sum_{j=m+1}^{k} (j - m) \cdot P(X = j)\]
where $X \sim \text{Hypergeometric}(N, \nu_\alpha, k)$ and the probability mass function is:
\[P(X = j) = \frac{C(\nu_\alpha, j) \cdot C(N - \nu_\alpha, k - j)}{C(N, k)}\]
The final metric is averaged over all questions:
\[\text{mG-Pass@k} = \frac{1}{M} \sum_{\alpha=1}^{M} \text{mG-Pass@k}_\alpha\]
Examples
R = [0 1 1 0 1;
1 1 0 1 1]
mg_pass_at_k(R, 2) # Returns ≈ 0.45
mg_pass_at_k(R, 3) # Returns ≈ 0.166667Ranking
Scorio.competition_ranks_from_scores — Function
competition_ranks_from_scores(scores_in_id_order::AbstractVector{<:Real}; tol::Real=1e-12) -> Vector{Int}Compute competition ranks from scores.
Given L models with ids 1..L and their scores, returns competition ranks (1,2,3,3,5,...). Models with tied scores (within tolerance) receive the same rank.
Arguments
scores_in_id_order::AbstractVector{<:Real}: list/array of scores aligned to ids 1..Ltol::Real=1e-12: tolerance for considering scores as tied
Returns
Vector{Int}: competition ranks for each model
Examples
scores = [0.95, 0.87, 0.87, 0.72, 0.65]
ranks = competition_ranks_from_scores(scores)
# Returns: [1, 2, 2, 4, 5]Scorio.elo — Function
elo(args...; kwargs...)Not yet implemented.