Evaluation API

Evaluation methods operate on outcome matrices R with shape (M, N) (or vectors coerced to 1 x N).

Bayes Family

Scorio.bayesFunction
bayes(R, w=nothing, R0=nothing) -> (mu, sigma)

Performance evaluation using the Bayes@N framework.

References

Hariri, M., Samandar, A., Hinczewski, M., & Chaudhary, V. (2026). Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation. arXiv preprint arXiv:2510.04265. https://arxiv.org/abs/2510.04265

Arguments

  • R::Union{AbstractVector, AbstractMatrix}: integer outcomes. A 1D input with length N is reshaped to $1 \times N$. After coercion, $R$ is $M \times N$ with entries in $\{0,\ldots,C\}$.
  • w: optional length-$(C+1)$ weight vector $(w_0,\ldots,w_C)$ mapping class $k$ to score $w_k$. If omitted and R is binary, defaults to [0.0, 1.0]. For non-binary R, w is required.
  • R0: optional prior outcomes. Accepts 1D/2D integer input; after coercion it must be an $M \times D$ matrix with entries in $\{0,\ldots,C\}$. If omitted, $D=0$.

Returns

  • Tuple{Float64, Float64}: $(\mu, \sigma)$ posterior mean and posterior standard deviation.

Notation

$\delta_{a,b}$ is the Kronecker delta. For each question $\alpha$ and class $k \in \{0,\ldots,C\}$:

\[n_{\alpha k} = \sum_{i=1}^{N} \delta_{k, R_{\alpha i}}\]

\[n^0_{\alpha k} = 1 + \sum_{i=1}^{D} \delta_{k, R^0_{\alpha i}}\]

\[\nu_{\alpha k} = n_{\alpha k} + n^0_{\alpha k}\]

The effective sample size is:

\[T = 1 + C + D + N\]

Formula

\[\mu = w_0 + \frac{1}{M \cdot T} \sum_{\alpha=1}^{M}\sum_{j=0}^{C}\nu_{\alpha j}(w_j - w_0)\]

\[\sigma = \sqrt{ \frac{1}{M^2 (T+1)} \sum_{\alpha=1}^{M} \left[ \sum_j \frac{\nu_{\alpha j}}{T}(w_j-w_0)^2 - \left(\sum_j \frac{\nu_{\alpha j}}{T}(w_j-w_0)\right)^2 \right] }\]

Examples

R = [0 1 2 2 1;
     1 1 0 2 2]
w = [0.0, 0.5, 1.0]
R0 = [0 2;
      1 2]

mu, sigma = bayes(R, w, R0)
source
bayes(
    R::AbstractArray{<:Integer, 3},
    w=nothing;
    R0=nothing,
    quantile=nothing,
    method="competition",
    return_scores=false,
)

Rank models by Bayes@N scores computed independently per model.

If quantile is provided, models are ranked by mu + z_q * sigma; otherwise by posterior mean mu.

References

Hariri, M., Samandar, A., Hinczewski, M., & Chaudhary, V. (2026). Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation. arXiv:2510.04265. https://arxiv.org/abs/2510.04265

Formula

For each model l, let (mu_l, sigma_l) = Scorio.bayes(R_l, w, R0_l).

\[s_l = \begin{cases} \mu_l, & \text{if quantile is not set} \\ \mu_l + \Phi^{-1}(q)\,\sigma_l, & \text{if quantile}=q \in [0,1] \end{cases}\]

Arguments

  • R: integer tensor (L, M, N) with values in {0, ..., C}.
  • w: class weights of length C+1. If not provided and R is binary (contains only 0 and 1), defaults to [0.0, 1.0]. For non-binary R, w is required.
  • R0: optional shared prior (M, D) or model-specific prior (L, M, D).
  • quantile: optional value in [0, 1] for quantile-adjusted ranking.
  • method, return_scores: ranking output controls.
source
Scorio.bayes_ciFunction
bayes_ci(R, w=nothing, R0=nothing, confidence=0.95, bounds=nothing)
    -> (mu, sigma, lo, hi)

Bayes@N posterior summary with a normal-approximation credible interval.

References

Hariri, M., Samandar, A., Hinczewski, M., & Chaudhary, V. (2026). Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation. arXiv preprint arXiv:2510.04265. https://arxiv.org/abs/2510.04265

Arguments

  • R::Union{AbstractVector, AbstractMatrix}: same contract as bayes: coerced to an $M \times N$ integer matrix.
  • w: same contract as bayes: optional class weights $(w_0,\ldots,w_C)$.
  • R0: same contract as bayes: optional prior outcomes as $M \times D$.
  • confidence::Real: credibility level $\gamma \in (0,1)$ (for example, 0.95).
  • bounds::Union{Nothing, Tuple{<:Real, <:Real}}: optional clipping interval $(\ell, u)$ applied to the returned bounds.

Returns

  • Tuple{Float64, Float64, Float64, Float64}: $(\mu, \sigma, \mathrm{lo}, \mathrm{hi})$.

Notation

Let $(\mu, \sigma)$ be the Bayes@N posterior summary returned by bayes on the same inputs. Let $\gamma = \texttt{confidence}$ and $z_{(1+\gamma)/2}$ be the standard normal quantile.

Formula

The interval is:

\[(\mathrm{lo}, \mathrm{hi}) = \mu \pm z_{(1+\gamma)/2}\,\sigma\]

and then clipped to bounds when provided.

Examples

R = [0 1 2 2 1;
     1 1 0 2 2]
w = [0.0, 0.5, 1.0]
R0 = [0 2;
      1 2]

mu, sigma, lo, hi = bayes_ci(R, w, R0, 0.95, (0.0, 1.0))
source

Avg Family

Scorio.avgFunction
avg(R, w=nothing) -> (a, sigma_a)

Average score with Bayes-scaled uncertainty (Avg@N).

References

Hariri, M., Samandar, A., Hinczewski, M., & Chaudhary, V. (2026). Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation. arXiv preprint arXiv:2510.04265. https://arxiv.org/abs/2510.04265

Arguments

  • R::Union{AbstractVector, AbstractMatrix}: integer outcomes. A 1D input with length N is reshaped to $1 \times N$. If w is omitted, entries must be binary in $\{0,1\}$.
  • w: optional length-$(C+1)$ weight vector $(w_0,\ldots,w_C)$. When omitted, binary mode is used with w = [0.0, 1.0].

Returns

  • Tuple{Float64, Float64}: $(a, \sigma_a)$ where $a$ is the weighted average and $\sigma_a$ is the uncertainty on the same scale.

Notation

After coercion, let outcomes be $R \in \{0,\ldots,C\}^{M \times N}$. For question $\alpha$ and trial $i$, the score contribution is $w_{R_{\alpha i}}$.

Formula

Point estimate:

\[a = \frac{1}{M \cdot N}\sum_{\alpha=1}^{M}\sum_{i=1}^{N} w_{R_{\alpha i}}\]

Uncertainty rescales Bayes@N uncertainty with $D=0$ and $T = 1 + C + N$:

\[\sigma_a = \frac{T}{N} \cdot \sigma_{\text{Bayes}}\]

Examples

R = [0 1 1 0 1;
     1 1 0 1 1]
a, sigma = avg(R)
source
avg(R; method="competition", return_scores=false)

Rank models by per-model mean accuracy across all questions and trials.

For each model l, compute the scalar score:

\[s_l^{\mathrm{avg}} = \frac{1}{MN}\sum_{m=1}^{M}\sum_{n=1}^{N} R_{lmn}\]

Higher scores are better; ranking is produced by rank_scores.

Arguments

  • R: binary response tensor (L, M, N) or matrix (L, M) promoted to (L, M, 1).
  • method: tie-handling rule for rank_scores.
  • return_scores: if true, return (ranking, scores).
source
Scorio.avg_ciFunction
avg_ci(R, w=nothing, confidence=0.95, bounds=nothing)
    -> (a, sigma_a, lo, hi)

Avg@N plus Bayesian uncertainty and normal-approximation credible interval.

References

Hariri, M., Samandar, A., Hinczewski, M., & Chaudhary, V. (2026). Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation. arXiv preprint arXiv:2510.04265. https://arxiv.org/abs/2510.04265

Arguments

  • R::Union{AbstractVector, AbstractMatrix}: same contract as avg: coerced to an $M \times N$ outcome matrix.
  • w: same contract as avg: optional weights $(w_0,\ldots,w_C)$.
  • confidence::Real: credibility level $\gamma \in (0,1)$.
  • bounds::Union{Nothing, Tuple{<:Real, <:Real}}: optional clipping interval $(\ell, u)$ applied to the returned bounds.

Returns

  • Tuple{Float64, Float64, Float64, Float64}: $(a, \sigma_a, \mathrm{lo}, \mathrm{hi})$.

Notation

Let $(a, \sigma_a)$ be the Avg@N summary from avg on the same inputs. Let $\gamma = \texttt{confidence}$ and $z_{(1+\gamma)/2}$ be the standard normal quantile.

Formula

\[(\mathrm{lo}, \mathrm{hi}) = a \pm z_{(1+\gamma)/2}\,\sigma_a\]

with $\gamma = \texttt{confidence}$, then clipped to bounds if provided.

Examples

R = [0 1 1 0 1;
     1 1 0 1 1]

a, sigma, lo, hi = avg_ci(R, nothing, 0.95, (0.0, 1.0))
source

Pass Family (Point Metrics)

Scorio.pass_at_kMethod
pass_at_k(R, k) -> Float64

Unbiased Pass@k estimator for binary outcomes.

Computes the probability that at least one of k selected samples is correct, averaged over all questions.

References

Chen, M., Tworek, J., Jun, H., et al. (2021). Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374. https://arxiv.org/abs/2107.03374

Arguments

  • R::Union{AbstractVector, AbstractMatrix}: binary outcomes. A 1D input with length N is reshaped to $1 \times N$. After coercion, $R \in \{0,1\}^{M \times N}$ with $R_{\alpha i}=1$ indicating success.
  • k::Integer: number of selected samples, constrained by $1 \le k \le N$.

Returns

  • Float64: average Pass@k across all $M$ questions.

Notation

For each question $\alpha$:

\[\nu_\alpha = \sum_{i=1}^{N} R_{\alpha i}\]

where $\nu_\alpha$ is the number of successful trials. Let $C(a,b)=\binom{a}{b}$.

Formula

\[\mathrm{Pass@k}_\alpha = 1 - \frac{C(N - \nu_\alpha, k)}{C(N, k)}\]

\[\mathrm{Pass@k} = \frac{1}{M}\sum_{\alpha=1}^{M}\mathrm{Pass@k}_\alpha\]

Examples

R = [1 0 1 0;
     0 0 1 1]

s = pass_at_k(R, 2)
source
Scorio.pass_hat_kMethod
pass_hat_k(R, k) -> Float64

Pass-hat@k (Pass^k): probability all k selected samples are correct.

Also known as G-Pass@k.

References

Yao, S., Shinn, N., Razavi, P., & Narasimhan, K. (2024). tau-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv preprint arXiv:2406.12045. https://arxiv.org/abs/2406.12045

Arguments

  • R::Union{AbstractVector, AbstractMatrix}: binary outcomes. A 1D input with length N is reshaped to $1 \times N$. After coercion, $R \in \{0,1\}^{M \times N}$.
  • k::Integer: number of selected samples, constrained by $1 \le k \le N$.

Returns

  • Float64: average Pass-hat@k (Pass^k) across all $M$ questions.

Notation

For each question $\alpha$:

\[\nu_\alpha = \sum_{i=1}^{N} R_{\alpha i}\]

with $C(a,b)=\binom{a}{b}$.

Formula

\[\widehat{\mathrm{Pass@k}}_\alpha = \frac{C(\nu_\alpha, k)}{C(N, k)}\]

\[\widehat{\mathrm{Pass@k}} = \frac{1}{M}\sum_{\alpha=1}^{M}\widehat{\mathrm{Pass@k}}_\alpha\]

Examples

R = [1 0 1 0;
     0 0 1 1]

s = pass_hat_k(R, 2)
source
Scorio.g_pass_at_kMethod
g_pass_at_k(R, k) -> Float64

Alias for pass_hat_k.

References

Yao, S., Shinn, N., Razavi, P., & Narasimhan, K. (2024). tau-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv preprint arXiv:2406.12045. https://arxiv.org/abs/2406.12045

Arguments

  • R::Union{AbstractVector, AbstractMatrix}: same contract as pass_hat_k: $R \in \{0,1\}^{M \times N}$ after coercion.
  • k::Integer: same contract as pass_hat_k: $1 \le k \le N$.

Returns

  • Float64: G-Pass@k score.

Notation

Let $\widehat{\mathrm{Pass@k}}$ be the value computed by pass_hat_k on the same R and k.

Formula

\[\mathrm{G\text{-}Pass@k} = \widehat{\mathrm{Pass@k}}\]

Examples

R = [1 0 1 0;
     0 0 1 1]

s = g_pass_at_k(R, 2)
source
Scorio.g_pass_at_k_tauMethod
g_pass_at_k_tau(R, k, tau) -> Float64

Generalized Pass@k with threshold tau.

Computes the probability of at least ceil(tau * k) successes in k draws, averaged across questions. Interpolates between Pass@k (tau=0) and Pass-hat@k (tau=1).

References

Liu, J., Liu, H., Xiao, L., et al. (2024). Are Your LLMs Capable of Stable Reasoning? arXiv preprint arXiv:2412.13147. https://arxiv.org/abs/2412.13147

Arguments

  • R::Union{AbstractVector, AbstractMatrix}: binary outcomes. A 1D input with length N is reshaped to $1 \times N$. After coercion, $R \in \{0,1\}^{M \times N}$.
  • k::Integer: number of selected samples, constrained by $1 \le k \le N$.
  • tau::Real: threshold $\tau \in [0,1]$. $\tau = 0$ recovers Pass@k, and $\tau = 1$ recovers Pass-hat@k.

Returns

  • Float64: average generalized pass score across questions.

Notation

For each question $\alpha$:

\[\nu_\alpha = \sum_{i=1}^{N} R_{\alpha i}\]

Let $j_0 = \lceil \tau k \rceil$ and $C(a,b)=\binom{a}{b}$.

Formula

\[\mathrm{G\text{-}Pass@k}_{\tau,\alpha} = \sum_{j=j_0}^{k} \frac{C(\nu_\alpha,j)\,C(N-\nu_\alpha,k-j)}{C(N,k)}\]

\[\mathrm{G\text{-}Pass@k}_{\tau} = \frac{1}{M}\sum_{\alpha=1}^{M} \mathrm{G\text{-}Pass@k}_{\tau,\alpha}\]

Examples

R = [1 0 1 0;
     0 0 1 1]

s = g_pass_at_k_tau(R, 3, 0.67)
source
Scorio.mg_pass_at_kMethod
mg_pass_at_k(R, k) -> Float64

Mean generalized pass metric over tau in [0.5, 1.0].

References

Liu, J., Liu, H., Xiao, L., et al. (2024). Are Your LLMs Capable of Stable Reasoning? arXiv preprint arXiv:2412.13147. https://arxiv.org/abs/2412.13147

Arguments

  • R::Union{AbstractVector, AbstractMatrix}: binary outcomes. A 1D input with length N is reshaped to $1 \times N$. After coercion, $R \in \{0,1\}^{M \times N}$.
  • k::Integer: number of selected samples, constrained by $1 \le k \le N$.

Returns

  • Float64: average mG-Pass@k score.

Notation

For each question $\alpha$:

\[\nu_\alpha = \sum_{i=1}^{N} R_{\alpha i}\]

Let $m = \lceil k/2 \rceil$ and $X_\alpha \sim \mathrm{Hypergeom}(N,\nu_\alpha,k)$.

Formula

\[\mathrm{mG\text{-}Pass@k}_\alpha = \frac{2}{k} \sum_{j=m+1}^{k} (j-m) \cdot P(X_\alpha = j)\]

\[\mathrm{mG\text{-}Pass@k} = \frac{1}{M} \sum_{\alpha=1}^{M} \mathrm{mG\text{-}Pass@k}_\alpha\]

Examples

R = [1 0 1 0;
     0 0 1 1]

s = mg_pass_at_k(R, 3)
source

Pass Family (Posterior + CI)

Scorio.pass_at_k_ciFunction
pass_at_k_ci(R, k, confidence=0.95, bounds=(0,1), alpha0=1, beta0=1)
    -> (mu, sigma, lo, hi)

Bayesian Pass@k posterior summary and normal-approximation credible interval.

References

Chen, M., Tworek, J., Jun, H., et al. (2021). Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374. https://arxiv.org/abs/2107.03374

Beta-Binomial posterior moments for the *_ci estimators follow the Scorio package implementation.

Arguments

  • R::Union{AbstractVector, AbstractMatrix}: binary outcomes. A 1D input with length N is reshaped to $1 \times N$. After coercion, $R \in \{0,1\}^{M \times N}$.
  • k::Integer: number of selected samples, constrained by $1 \le k \le N$.
  • confidence::Real: credibility level $\gamma \in (0,1)$.
  • bounds::Tuple{<:Real, <:Real}: clipping interval $(\ell, u)$ applied to returned bounds.
  • alpha0::Real, beta0::Real: Beta prior parameters with $\alpha_0 > 0$ and $\beta_0 > 0$.

Returns

  • Tuple{Float64, Float64, Float64, Float64}: $(\mu, \sigma, \mathrm{lo}, \mathrm{hi})$.

Notation

After coercion, $R \in \{0,1\}^{M \times N}$. For question $\alpha$, let:

\[c_\alpha = \sum_{i=1}^{N} R_{\alpha i}\]

and latent success probability:

\[p_\alpha \mid R \sim \mathrm{Beta}(\alpha_0 + c_\alpha,\; \beta_0 + N - c_\alpha)\]

Define $g(p) = 1 - (1-p)^k$ and $\gamma = \texttt{confidence}$.

Formula

Dataset-level moments are:

\[\mu = \frac{1}{M}\sum_{\alpha=1}^{M} \mathbb{E}[g(p_\alpha)], \quad \sigma = \frac{1}{M}\sqrt{\sum_{\alpha=1}^{M}\mathrm{Var}[g(p_\alpha)]}\]

Credible interval:

\[(\mathrm{lo}, \mathrm{hi}) = \mu \pm z_{(1+\gamma)/2}\sigma\]

then clipped to bounds.

Examples

R = [1 0 1 0;
     0 0 1 1]

mu, sigma, lo, hi = pass_at_k_ci(R, 2, 0.95, (0.0, 1.0), 1.0, 1.0)
source
Scorio.pass_hat_k_ciFunction
pass_hat_k_ci(R, k, confidence=0.95, bounds=(0,1), alpha0=1, beta0=1)
    -> (mu, sigma, lo, hi)

Bayesian Pass-hat@k (Pass^k) posterior summary and credible interval.

References

Yao, S., Shinn, N., Razavi, P., & Narasimhan, K. (2024). tau-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv preprint arXiv:2406.12045. https://arxiv.org/abs/2406.12045

Beta-Binomial posterior moments for the *_ci estimators follow the Scorio package implementation.

Arguments

  • R::Union{AbstractVector, AbstractMatrix}: same contract as pass_at_k_ci: $R \in \{0,1\}^{M \times N}$ after coercion.
  • k::Integer: same contract as pass_at_k_ci: $1 \le k \le N$.
  • confidence::Real: credibility level $\gamma \in (0,1)$.
  • bounds::Tuple{<:Real, <:Real}: clipping interval $(\ell, u)$.
  • alpha0::Real, beta0::Real: Beta prior parameters.

Returns

  • Tuple{Float64, Float64, Float64, Float64}: $(\mu, \sigma, \mathrm{lo}, \mathrm{hi})$.

Notation

Use the same posterior model and symbols as pass_at_k_ci, and define:

\[g(p) = p^k\]

with $\gamma = \texttt{confidence}$.

Formula

Dataset-level moments are:

\[\mu = \frac{1}{M}\sum_{\alpha=1}^{M} \mathbb{E}[g(p_\alpha)], \quad \sigma = \frac{1}{M}\sqrt{\sum_{\alpha=1}^{M}\mathrm{Var}[g(p_\alpha)]}\]

Credible interval:

\[(\mathrm{lo}, \mathrm{hi}) = \mu \pm z_{(1+\gamma)/2}\sigma\]

then clipped to bounds.

Examples

R = [1 0 1 0;
     0 0 1 1]

mu, sigma, lo, hi = pass_hat_k_ci(R, 2, 0.95, (0.0, 1.0), 1.0, 1.0)
source
Scorio.g_pass_at_k_ciFunction
g_pass_at_k_ci(R, k, confidence=0.95, bounds=(0,1), alpha0=1, beta0=1)
    -> (mu, sigma, lo, hi)

Alias for pass_hat_k_ci.

References

Yao, S., Shinn, N., Razavi, P., & Narasimhan, K. (2024). tau-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv preprint arXiv:2406.12045. https://arxiv.org/abs/2406.12045

Arguments

  • R::Union{AbstractVector, AbstractMatrix}: same contract as pass_hat_k_ci: $R \in \{0,1\}^{M \times N}$ after coercion.
  • k::Integer: same contract as pass_hat_k_ci: $1 \le k \le N$.
  • confidence::Real: credibility level $\gamma \in (0,1)$.
  • bounds::Tuple{<:Real, <:Real}: clipping interval $(\ell, u)$.
  • alpha0::Real, beta0::Real: Beta prior parameters.

Returns

  • Tuple{Float64, Float64, Float64, Float64}: $(\mu, \sigma, \mathrm{lo}, \mathrm{hi})$.

Notation

Let $(\mu, \sigma, \mathrm{lo}, \mathrm{hi})$ be the result from pass_hat_k_ci on the same inputs.

Formula

\[\mathrm{G\text{-}Pass@k}_{\mathrm{ci}} = \widehat{\mathrm{Pass@k}}_{\mathrm{ci}}\]

Examples

R = [1 0 1 0;
     0 0 1 1]

mu, sigma, lo, hi = g_pass_at_k_ci(R, 2)
source
Scorio.g_pass_at_k_tau_ciFunction
g_pass_at_k_tau_ci(R, k, tau, confidence=0.95, bounds=(0,1), alpha0=1, beta0=1)
    -> (mu, sigma, lo, hi)

Bayesian generalized Pass@k with threshold $\tau$ and credible interval.

References

Liu, J., Liu, H., Xiao, L., et al. (2024). Are Your LLMs Capable of Stable Reasoning? arXiv preprint arXiv:2412.13147. https://arxiv.org/abs/2412.13147

Beta-Binomial posterior moments for the *_ci estimators follow the Scorio package implementation.

Arguments

  • R::Union{AbstractVector, AbstractMatrix}: binary outcomes. A 1D input with length N is reshaped to $1 \times N$. After coercion, $R \in \{0,1\}^{M \times N}$.
  • k::Integer: number of selected samples, constrained by $1 \le k \le N$.
  • tau::Real: threshold $\tau \in [0,1]$.
  • confidence::Real: credibility level $\gamma \in (0,1)$.
  • bounds::Tuple{<:Real, <:Real}: clipping interval $(\ell, u)$.
  • alpha0::Real, beta0::Real: Beta prior parameters.

Returns

  • Tuple{Float64, Float64, Float64, Float64}: $(\mu, \sigma, \mathrm{lo}, \mathrm{hi})$.

Notation

Use the same posterior model and symbols as pass_at_k_ci. Let:

\[j_0 = \lceil \tau k \rceil\]

and:

\[g(p) = \sum_{j=j_0}^{k} \binom{k}{j} p^j (1-p)^{k-j}\]

with $\gamma = \texttt{confidence}$.

Formula

Dataset-level moments are:

\[\mu = \frac{1}{M}\sum_{\alpha=1}^{M} \mathbb{E}[g(p_\alpha)], \quad \sigma = \frac{1}{M}\sqrt{\sum_{\alpha=1}^{M}\mathrm{Var}[g(p_\alpha)]}\]

Credible interval:

\[(\mathrm{lo}, \mathrm{hi}) = \mu \pm z_{(1+\gamma)/2}\sigma\]

then clipped to bounds.

Examples

R = [1 0 1 0;
     0 0 1 1]

mu, sigma, lo, hi = g_pass_at_k_tau_ci(R, 3, 0.67)
source
Scorio.mg_pass_at_k_ciFunction
mg_pass_at_k_ci(R, k, confidence=0.95, bounds=(0,1), alpha0=1, beta0=1)
    -> (mu, sigma, lo, hi)

Bayesian mG-Pass@k posterior summary and credible interval.

References

Liu, J., Liu, H., Xiao, L., et al. (2024). Are Your LLMs Capable of Stable Reasoning? arXiv preprint arXiv:2412.13147. https://arxiv.org/abs/2412.13147

Beta-Binomial posterior moments for the *_ci estimators follow the Scorio package implementation.

Arguments

  • R::Union{AbstractVector, AbstractMatrix}: binary outcomes. A 1D input with length N is reshaped to $1 \times N$. After coercion, $R \in \{0,1\}^{M \times N}$.
  • k::Integer: number of selected samples, constrained by $1 \le k \le N$.
  • confidence::Real: credibility level $\gamma \in (0,1)$.
  • bounds::Tuple{<:Real, <:Real}: clipping interval $(\ell, u)$.
  • alpha0::Real, beta0::Real: Beta prior parameters.

Returns

  • Tuple{Float64, Float64, Float64, Float64}: $(\mu, \sigma, \mathrm{lo}, \mathrm{hi})$.

Notation

Use the same posterior model and symbols as pass_at_k_ci. Let $m = \lceil k/2 \rceil$ and:

\[g(p)=\frac{2}{k}\sum_{j=m+1}^{k}(j-m)\binom{k}{j}p^j(1-p)^{k-j}\]

with $\gamma = \texttt{confidence}$.

Formula

Dataset-level moments are:

\[\mu = \frac{1}{M}\sum_{\alpha=1}^{M} \mathbb{E}[g(p_\alpha)], \quad \sigma = \frac{1}{M}\sqrt{\sum_{\alpha=1}^{M}\mathrm{Var}[g(p_\alpha)]}\]

Credible interval:

\[(\mathrm{lo}, \mathrm{hi}) = \mu \pm z_{(1+\gamma)/2}\sigma\]

then clipped to bounds.

Examples

R = [1 0 1 0;
     0 0 1 1]

mu, sigma, lo, hi = mg_pass_at_k_ci(R, 3)
source