Evaluation API
Evaluation methods operate on outcome matrices R with shape (M, N) (or vectors coerced to 1 x N).
Bayes Family
Scorio.bayes — Function
bayes(R, w=nothing, R0=nothing) -> (mu, sigma)Performance evaluation using the Bayes@N framework.
References
Hariri, M., Samandar, A., Hinczewski, M., & Chaudhary, V. (2026). Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation. arXiv preprint arXiv:2510.04265. https://arxiv.org/abs/2510.04265
Arguments
R::Union{AbstractVector, AbstractMatrix}: integer outcomes. A 1D input with lengthNis reshaped to $1 \times N$. After coercion, $R$ is $M \times N$ with entries in $\{0,\ldots,C\}$.w: optional length-$(C+1)$ weight vector $(w_0,\ldots,w_C)$ mapping class $k$ to score $w_k$. If omitted andRis binary, defaults to[0.0, 1.0]. For non-binaryR,wis required.R0: optional prior outcomes. Accepts 1D/2D integer input; after coercion it must be an $M \times D$ matrix with entries in $\{0,\ldots,C\}$. If omitted, $D=0$.
Returns
Tuple{Float64, Float64}: $(\mu, \sigma)$ posterior mean and posterior standard deviation.
Notation
$\delta_{a,b}$ is the Kronecker delta. For each question $\alpha$ and class $k \in \{0,\ldots,C\}$:
\[n_{\alpha k} = \sum_{i=1}^{N} \delta_{k, R_{\alpha i}}\]
\[n^0_{\alpha k} = 1 + \sum_{i=1}^{D} \delta_{k, R^0_{\alpha i}}\]
\[\nu_{\alpha k} = n_{\alpha k} + n^0_{\alpha k}\]
The effective sample size is:
\[T = 1 + C + D + N\]
Formula
\[\mu = w_0 + \frac{1}{M \cdot T} \sum_{\alpha=1}^{M}\sum_{j=0}^{C}\nu_{\alpha j}(w_j - w_0)\]
\[\sigma = \sqrt{ \frac{1}{M^2 (T+1)} \sum_{\alpha=1}^{M} \left[ \sum_j \frac{\nu_{\alpha j}}{T}(w_j-w_0)^2 - \left(\sum_j \frac{\nu_{\alpha j}}{T}(w_j-w_0)\right)^2 \right] }\]
Examples
R = [0 1 2 2 1;
1 1 0 2 2]
w = [0.0, 0.5, 1.0]
R0 = [0 2;
1 2]
mu, sigma = bayes(R, w, R0)bayes(
R::AbstractArray{<:Integer, 3},
w=nothing;
R0=nothing,
quantile=nothing,
method="competition",
return_scores=false,
)Rank models by Bayes@N scores computed independently per model.
If quantile is provided, models are ranked by mu + z_q * sigma; otherwise by posterior mean mu.
References
Hariri, M., Samandar, A., Hinczewski, M., & Chaudhary, V. (2026). Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation. arXiv:2510.04265. https://arxiv.org/abs/2510.04265
Formula
For each model l, let (mu_l, sigma_l) = Scorio.bayes(R_l, w, R0_l).
\[s_l = \begin{cases} \mu_l, & \text{if quantile is not set} \\ \mu_l + \Phi^{-1}(q)\,\sigma_l, & \text{if quantile}=q \in [0,1] \end{cases}\]
Arguments
R: integer tensor(L, M, N)with values in{0, ..., C}.w: class weights of lengthC+1. If not provided and R is binary (contains only 0 and 1), defaults to[0.0, 1.0]. For non-binary R, w is required.R0: optional shared prior(M, D)or model-specific prior(L, M, D).quantile: optional value in[0, 1]for quantile-adjusted ranking.method,return_scores: ranking output controls.
Scorio.bayes_ci — Function
bayes_ci(R, w=nothing, R0=nothing, confidence=0.95, bounds=nothing)
-> (mu, sigma, lo, hi)Bayes@N posterior summary with a normal-approximation credible interval.
References
Hariri, M., Samandar, A., Hinczewski, M., & Chaudhary, V. (2026). Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation. arXiv preprint arXiv:2510.04265. https://arxiv.org/abs/2510.04265
Arguments
R::Union{AbstractVector, AbstractMatrix}: same contract asbayes: coerced to an $M \times N$ integer matrix.w: same contract asbayes: optional class weights $(w_0,\ldots,w_C)$.R0: same contract asbayes: optional prior outcomes as $M \times D$.confidence::Real: credibility level $\gamma \in (0,1)$ (for example,0.95).bounds::Union{Nothing, Tuple{<:Real, <:Real}}: optional clipping interval $(\ell, u)$ applied to the returned bounds.
Returns
Tuple{Float64, Float64, Float64, Float64}: $(\mu, \sigma, \mathrm{lo}, \mathrm{hi})$.
Notation
Let $(\mu, \sigma)$ be the Bayes@N posterior summary returned by bayes on the same inputs. Let $\gamma = \texttt{confidence}$ and $z_{(1+\gamma)/2}$ be the standard normal quantile.
Formula
The interval is:
\[(\mathrm{lo}, \mathrm{hi}) = \mu \pm z_{(1+\gamma)/2}\,\sigma\]
and then clipped to bounds when provided.
Examples
R = [0 1 2 2 1;
1 1 0 2 2]
w = [0.0, 0.5, 1.0]
R0 = [0 2;
1 2]
mu, sigma, lo, hi = bayes_ci(R, w, R0, 0.95, (0.0, 1.0))Avg Family
Scorio.avg — Function
avg(R, w=nothing) -> (a, sigma_a)Average score with Bayes-scaled uncertainty (Avg@N).
References
Hariri, M., Samandar, A., Hinczewski, M., & Chaudhary, V. (2026). Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation. arXiv preprint arXiv:2510.04265. https://arxiv.org/abs/2510.04265
Arguments
R::Union{AbstractVector, AbstractMatrix}: integer outcomes. A 1D input with lengthNis reshaped to $1 \times N$. Ifwis omitted, entries must be binary in $\{0,1\}$.w: optional length-$(C+1)$ weight vector $(w_0,\ldots,w_C)$. When omitted, binary mode is used withw = [0.0, 1.0].
Returns
Tuple{Float64, Float64}: $(a, \sigma_a)$ where $a$ is the weighted average and $\sigma_a$ is the uncertainty on the same scale.
Notation
After coercion, let outcomes be $R \in \{0,\ldots,C\}^{M \times N}$. For question $\alpha$ and trial $i$, the score contribution is $w_{R_{\alpha i}}$.
Formula
Point estimate:
\[a = \frac{1}{M \cdot N}\sum_{\alpha=1}^{M}\sum_{i=1}^{N} w_{R_{\alpha i}}\]
Uncertainty rescales Bayes@N uncertainty with $D=0$ and $T = 1 + C + N$:
\[\sigma_a = \frac{T}{N} \cdot \sigma_{\text{Bayes}}\]
Examples
R = [0 1 1 0 1;
1 1 0 1 1]
a, sigma = avg(R)avg(R; method="competition", return_scores=false)Rank models by per-model mean accuracy across all questions and trials.
For each model l, compute the scalar score:
\[s_l^{\mathrm{avg}} = \frac{1}{MN}\sum_{m=1}^{M}\sum_{n=1}^{N} R_{lmn}\]
Higher scores are better; ranking is produced by rank_scores.
Arguments
R: binary response tensor(L, M, N)or matrix(L, M)promoted to(L, M, 1).method: tie-handling rule forrank_scores.return_scores: iftrue, return(ranking, scores).
Scorio.avg_ci — Function
avg_ci(R, w=nothing, confidence=0.95, bounds=nothing)
-> (a, sigma_a, lo, hi)Avg@N plus Bayesian uncertainty and normal-approximation credible interval.
References
Hariri, M., Samandar, A., Hinczewski, M., & Chaudhary, V. (2026). Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation. arXiv preprint arXiv:2510.04265. https://arxiv.org/abs/2510.04265
Arguments
R::Union{AbstractVector, AbstractMatrix}: same contract asavg: coerced to an $M \times N$ outcome matrix.w: same contract asavg: optional weights $(w_0,\ldots,w_C)$.confidence::Real: credibility level $\gamma \in (0,1)$.bounds::Union{Nothing, Tuple{<:Real, <:Real}}: optional clipping interval $(\ell, u)$ applied to the returned bounds.
Returns
Tuple{Float64, Float64, Float64, Float64}: $(a, \sigma_a, \mathrm{lo}, \mathrm{hi})$.
Notation
Let $(a, \sigma_a)$ be the Avg@N summary from avg on the same inputs. Let $\gamma = \texttt{confidence}$ and $z_{(1+\gamma)/2}$ be the standard normal quantile.
Formula
\[(\mathrm{lo}, \mathrm{hi}) = a \pm z_{(1+\gamma)/2}\,\sigma_a\]
with $\gamma = \texttt{confidence}$, then clipped to bounds if provided.
Examples
R = [0 1 1 0 1;
1 1 0 1 1]
a, sigma, lo, hi = avg_ci(R, nothing, 0.95, (0.0, 1.0))Pass Family (Point Metrics)
Scorio.pass_at_k — Method
pass_at_k(R, k) -> Float64Unbiased Pass@k estimator for binary outcomes.
Computes the probability that at least one of k selected samples is correct, averaged over all questions.
References
Chen, M., Tworek, J., Jun, H., et al. (2021). Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374. https://arxiv.org/abs/2107.03374
Arguments
R::Union{AbstractVector, AbstractMatrix}: binary outcomes. A 1D input with lengthNis reshaped to $1 \times N$. After coercion, $R \in \{0,1\}^{M \times N}$ with $R_{\alpha i}=1$ indicating success.k::Integer: number of selected samples, constrained by $1 \le k \le N$.
Returns
Float64: average Pass@k across all $M$ questions.
Notation
For each question $\alpha$:
\[\nu_\alpha = \sum_{i=1}^{N} R_{\alpha i}\]
where $\nu_\alpha$ is the number of successful trials. Let $C(a,b)=\binom{a}{b}$.
Formula
\[\mathrm{Pass@k}_\alpha = 1 - \frac{C(N - \nu_\alpha, k)}{C(N, k)}\]
\[\mathrm{Pass@k} = \frac{1}{M}\sum_{\alpha=1}^{M}\mathrm{Pass@k}_\alpha\]
Examples
R = [1 0 1 0;
0 0 1 1]
s = pass_at_k(R, 2)Scorio.pass_hat_k — Method
pass_hat_k(R, k) -> Float64Pass-hat@k (Pass^k): probability all k selected samples are correct.
Also known as G-Pass@k.
References
Yao, S., Shinn, N., Razavi, P., & Narasimhan, K. (2024). tau-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv preprint arXiv:2406.12045. https://arxiv.org/abs/2406.12045
Arguments
R::Union{AbstractVector, AbstractMatrix}: binary outcomes. A 1D input with lengthNis reshaped to $1 \times N$. After coercion, $R \in \{0,1\}^{M \times N}$.k::Integer: number of selected samples, constrained by $1 \le k \le N$.
Returns
Float64: average Pass-hat@k (Pass^k) across all $M$ questions.
Notation
For each question $\alpha$:
\[\nu_\alpha = \sum_{i=1}^{N} R_{\alpha i}\]
with $C(a,b)=\binom{a}{b}$.
Formula
\[\widehat{\mathrm{Pass@k}}_\alpha = \frac{C(\nu_\alpha, k)}{C(N, k)}\]
\[\widehat{\mathrm{Pass@k}} = \frac{1}{M}\sum_{\alpha=1}^{M}\widehat{\mathrm{Pass@k}}_\alpha\]
Examples
R = [1 0 1 0;
0 0 1 1]
s = pass_hat_k(R, 2)Scorio.g_pass_at_k — Method
g_pass_at_k(R, k) -> Float64Alias for pass_hat_k.
References
Yao, S., Shinn, N., Razavi, P., & Narasimhan, K. (2024). tau-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv preprint arXiv:2406.12045. https://arxiv.org/abs/2406.12045
Arguments
R::Union{AbstractVector, AbstractMatrix}: same contract aspass_hat_k: $R \in \{0,1\}^{M \times N}$ after coercion.k::Integer: same contract aspass_hat_k: $1 \le k \le N$.
Returns
Float64: G-Pass@k score.
Notation
Let $\widehat{\mathrm{Pass@k}}$ be the value computed by pass_hat_k on the same R and k.
Formula
\[\mathrm{G\text{-}Pass@k} = \widehat{\mathrm{Pass@k}}\]
Examples
R = [1 0 1 0;
0 0 1 1]
s = g_pass_at_k(R, 2)Scorio.g_pass_at_k_tau — Method
g_pass_at_k_tau(R, k, tau) -> Float64Generalized Pass@k with threshold tau.
Computes the probability of at least ceil(tau * k) successes in k draws, averaged across questions. Interpolates between Pass@k (tau=0) and Pass-hat@k (tau=1).
References
Liu, J., Liu, H., Xiao, L., et al. (2024). Are Your LLMs Capable of Stable Reasoning? arXiv preprint arXiv:2412.13147. https://arxiv.org/abs/2412.13147
Arguments
R::Union{AbstractVector, AbstractMatrix}: binary outcomes. A 1D input with lengthNis reshaped to $1 \times N$. After coercion, $R \in \{0,1\}^{M \times N}$.k::Integer: number of selected samples, constrained by $1 \le k \le N$.tau::Real: threshold $\tau \in [0,1]$. $\tau = 0$ recovers Pass@k, and $\tau = 1$ recovers Pass-hat@k.
Returns
Float64: average generalized pass score across questions.
Notation
For each question $\alpha$:
\[\nu_\alpha = \sum_{i=1}^{N} R_{\alpha i}\]
Let $j_0 = \lceil \tau k \rceil$ and $C(a,b)=\binom{a}{b}$.
Formula
\[\mathrm{G\text{-}Pass@k}_{\tau,\alpha} = \sum_{j=j_0}^{k} \frac{C(\nu_\alpha,j)\,C(N-\nu_\alpha,k-j)}{C(N,k)}\]
\[\mathrm{G\text{-}Pass@k}_{\tau} = \frac{1}{M}\sum_{\alpha=1}^{M} \mathrm{G\text{-}Pass@k}_{\tau,\alpha}\]
Examples
R = [1 0 1 0;
0 0 1 1]
s = g_pass_at_k_tau(R, 3, 0.67)Scorio.mg_pass_at_k — Method
mg_pass_at_k(R, k) -> Float64Mean generalized pass metric over tau in [0.5, 1.0].
References
Liu, J., Liu, H., Xiao, L., et al. (2024). Are Your LLMs Capable of Stable Reasoning? arXiv preprint arXiv:2412.13147. https://arxiv.org/abs/2412.13147
Arguments
R::Union{AbstractVector, AbstractMatrix}: binary outcomes. A 1D input with lengthNis reshaped to $1 \times N$. After coercion, $R \in \{0,1\}^{M \times N}$.k::Integer: number of selected samples, constrained by $1 \le k \le N$.
Returns
Float64: average mG-Pass@k score.
Notation
For each question $\alpha$:
\[\nu_\alpha = \sum_{i=1}^{N} R_{\alpha i}\]
Let $m = \lceil k/2 \rceil$ and $X_\alpha \sim \mathrm{Hypergeom}(N,\nu_\alpha,k)$.
Formula
\[\mathrm{mG\text{-}Pass@k}_\alpha = \frac{2}{k} \sum_{j=m+1}^{k} (j-m) \cdot P(X_\alpha = j)\]
\[\mathrm{mG\text{-}Pass@k} = \frac{1}{M} \sum_{\alpha=1}^{M} \mathrm{mG\text{-}Pass@k}_\alpha\]
Examples
R = [1 0 1 0;
0 0 1 1]
s = mg_pass_at_k(R, 3)Pass Family (Posterior + CI)
Scorio.pass_at_k_ci — Function
pass_at_k_ci(R, k, confidence=0.95, bounds=(0,1), alpha0=1, beta0=1)
-> (mu, sigma, lo, hi)Bayesian Pass@k posterior summary and normal-approximation credible interval.
References
Chen, M., Tworek, J., Jun, H., et al. (2021). Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374. https://arxiv.org/abs/2107.03374
Beta-Binomial posterior moments for the *_ci estimators follow the Scorio package implementation.
Arguments
R::Union{AbstractVector, AbstractMatrix}: binary outcomes. A 1D input with lengthNis reshaped to $1 \times N$. After coercion, $R \in \{0,1\}^{M \times N}$.k::Integer: number of selected samples, constrained by $1 \le k \le N$.confidence::Real: credibility level $\gamma \in (0,1)$.bounds::Tuple{<:Real, <:Real}: clipping interval $(\ell, u)$ applied to returned bounds.alpha0::Real,beta0::Real: Beta prior parameters with $\alpha_0 > 0$ and $\beta_0 > 0$.
Returns
Tuple{Float64, Float64, Float64, Float64}: $(\mu, \sigma, \mathrm{lo}, \mathrm{hi})$.
Notation
After coercion, $R \in \{0,1\}^{M \times N}$. For question $\alpha$, let:
\[c_\alpha = \sum_{i=1}^{N} R_{\alpha i}\]
and latent success probability:
\[p_\alpha \mid R \sim \mathrm{Beta}(\alpha_0 + c_\alpha,\; \beta_0 + N - c_\alpha)\]
Define $g(p) = 1 - (1-p)^k$ and $\gamma = \texttt{confidence}$.
Formula
Dataset-level moments are:
\[\mu = \frac{1}{M}\sum_{\alpha=1}^{M} \mathbb{E}[g(p_\alpha)], \quad \sigma = \frac{1}{M}\sqrt{\sum_{\alpha=1}^{M}\mathrm{Var}[g(p_\alpha)]}\]
Credible interval:
\[(\mathrm{lo}, \mathrm{hi}) = \mu \pm z_{(1+\gamma)/2}\sigma\]
then clipped to bounds.
Examples
R = [1 0 1 0;
0 0 1 1]
mu, sigma, lo, hi = pass_at_k_ci(R, 2, 0.95, (0.0, 1.0), 1.0, 1.0)Scorio.pass_hat_k_ci — Function
pass_hat_k_ci(R, k, confidence=0.95, bounds=(0,1), alpha0=1, beta0=1)
-> (mu, sigma, lo, hi)Bayesian Pass-hat@k (Pass^k) posterior summary and credible interval.
References
Yao, S., Shinn, N., Razavi, P., & Narasimhan, K. (2024). tau-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv preprint arXiv:2406.12045. https://arxiv.org/abs/2406.12045
Beta-Binomial posterior moments for the *_ci estimators follow the Scorio package implementation.
Arguments
R::Union{AbstractVector, AbstractMatrix}: same contract aspass_at_k_ci: $R \in \{0,1\}^{M \times N}$ after coercion.k::Integer: same contract aspass_at_k_ci: $1 \le k \le N$.confidence::Real: credibility level $\gamma \in (0,1)$.bounds::Tuple{<:Real, <:Real}: clipping interval $(\ell, u)$.alpha0::Real,beta0::Real: Beta prior parameters.
Returns
Tuple{Float64, Float64, Float64, Float64}: $(\mu, \sigma, \mathrm{lo}, \mathrm{hi})$.
Notation
Use the same posterior model and symbols as pass_at_k_ci, and define:
\[g(p) = p^k\]
with $\gamma = \texttt{confidence}$.
Formula
Dataset-level moments are:
\[\mu = \frac{1}{M}\sum_{\alpha=1}^{M} \mathbb{E}[g(p_\alpha)], \quad \sigma = \frac{1}{M}\sqrt{\sum_{\alpha=1}^{M}\mathrm{Var}[g(p_\alpha)]}\]
Credible interval:
\[(\mathrm{lo}, \mathrm{hi}) = \mu \pm z_{(1+\gamma)/2}\sigma\]
then clipped to bounds.
Examples
R = [1 0 1 0;
0 0 1 1]
mu, sigma, lo, hi = pass_hat_k_ci(R, 2, 0.95, (0.0, 1.0), 1.0, 1.0)Scorio.g_pass_at_k_ci — Function
g_pass_at_k_ci(R, k, confidence=0.95, bounds=(0,1), alpha0=1, beta0=1)
-> (mu, sigma, lo, hi)Alias for pass_hat_k_ci.
References
Yao, S., Shinn, N., Razavi, P., & Narasimhan, K. (2024). tau-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv preprint arXiv:2406.12045. https://arxiv.org/abs/2406.12045
Arguments
R::Union{AbstractVector, AbstractMatrix}: same contract aspass_hat_k_ci: $R \in \{0,1\}^{M \times N}$ after coercion.k::Integer: same contract aspass_hat_k_ci: $1 \le k \le N$.confidence::Real: credibility level $\gamma \in (0,1)$.bounds::Tuple{<:Real, <:Real}: clipping interval $(\ell, u)$.alpha0::Real,beta0::Real: Beta prior parameters.
Returns
Tuple{Float64, Float64, Float64, Float64}: $(\mu, \sigma, \mathrm{lo}, \mathrm{hi})$.
Notation
Let $(\mu, \sigma, \mathrm{lo}, \mathrm{hi})$ be the result from pass_hat_k_ci on the same inputs.
Formula
\[\mathrm{G\text{-}Pass@k}_{\mathrm{ci}} = \widehat{\mathrm{Pass@k}}_{\mathrm{ci}}\]
Examples
R = [1 0 1 0;
0 0 1 1]
mu, sigma, lo, hi = g_pass_at_k_ci(R, 2)Scorio.g_pass_at_k_tau_ci — Function
g_pass_at_k_tau_ci(R, k, tau, confidence=0.95, bounds=(0,1), alpha0=1, beta0=1)
-> (mu, sigma, lo, hi)Bayesian generalized Pass@k with threshold $\tau$ and credible interval.
References
Liu, J., Liu, H., Xiao, L., et al. (2024). Are Your LLMs Capable of Stable Reasoning? arXiv preprint arXiv:2412.13147. https://arxiv.org/abs/2412.13147
Beta-Binomial posterior moments for the *_ci estimators follow the Scorio package implementation.
Arguments
R::Union{AbstractVector, AbstractMatrix}: binary outcomes. A 1D input with lengthNis reshaped to $1 \times N$. After coercion, $R \in \{0,1\}^{M \times N}$.k::Integer: number of selected samples, constrained by $1 \le k \le N$.tau::Real: threshold $\tau \in [0,1]$.confidence::Real: credibility level $\gamma \in (0,1)$.bounds::Tuple{<:Real, <:Real}: clipping interval $(\ell, u)$.alpha0::Real,beta0::Real: Beta prior parameters.
Returns
Tuple{Float64, Float64, Float64, Float64}: $(\mu, \sigma, \mathrm{lo}, \mathrm{hi})$.
Notation
Use the same posterior model and symbols as pass_at_k_ci. Let:
\[j_0 = \lceil \tau k \rceil\]
and:
\[g(p) = \sum_{j=j_0}^{k} \binom{k}{j} p^j (1-p)^{k-j}\]
with $\gamma = \texttt{confidence}$.
Formula
Dataset-level moments are:
\[\mu = \frac{1}{M}\sum_{\alpha=1}^{M} \mathbb{E}[g(p_\alpha)], \quad \sigma = \frac{1}{M}\sqrt{\sum_{\alpha=1}^{M}\mathrm{Var}[g(p_\alpha)]}\]
Credible interval:
\[(\mathrm{lo}, \mathrm{hi}) = \mu \pm z_{(1+\gamma)/2}\sigma\]
then clipped to bounds.
Examples
R = [1 0 1 0;
0 0 1 1]
mu, sigma, lo, hi = g_pass_at_k_tau_ci(R, 3, 0.67)Scorio.mg_pass_at_k_ci — Function
mg_pass_at_k_ci(R, k, confidence=0.95, bounds=(0,1), alpha0=1, beta0=1)
-> (mu, sigma, lo, hi)Bayesian mG-Pass@k posterior summary and credible interval.
References
Liu, J., Liu, H., Xiao, L., et al. (2024). Are Your LLMs Capable of Stable Reasoning? arXiv preprint arXiv:2412.13147. https://arxiv.org/abs/2412.13147
Beta-Binomial posterior moments for the *_ci estimators follow the Scorio package implementation.
Arguments
R::Union{AbstractVector, AbstractMatrix}: binary outcomes. A 1D input with lengthNis reshaped to $1 \times N$. After coercion, $R \in \{0,1\}^{M \times N}$.k::Integer: number of selected samples, constrained by $1 \le k \le N$.confidence::Real: credibility level $\gamma \in (0,1)$.bounds::Tuple{<:Real, <:Real}: clipping interval $(\ell, u)$.alpha0::Real,beta0::Real: Beta prior parameters.
Returns
Tuple{Float64, Float64, Float64, Float64}: $(\mu, \sigma, \mathrm{lo}, \mathrm{hi})$.
Notation
Use the same posterior model and symbols as pass_at_k_ci. Let $m = \lceil k/2 \rceil$ and:
\[g(p)=\frac{2}{k}\sum_{j=m+1}^{k}(j-m)\binom{k}{j}p^j(1-p)^{k-j}\]
with $\gamma = \texttt{confidence}$.
Formula
Dataset-level moments are:
\[\mu = \frac{1}{M}\sum_{\alpha=1}^{M} \mathbb{E}[g(p_\alpha)], \quad \sigma = \frac{1}{M}\sqrt{\sum_{\alpha=1}^{M}\mathrm{Var}[g(p_\alpha)]}\]
Credible interval:
\[(\mathrm{lo}, \mathrm{hi}) = \mu \pm z_{(1+\gamma)/2}\sigma\]
then clipped to bounds.
Examples
R = [1 0 1 0;
0 0 1 1]
mu, sigma, lo, hi = mg_pass_at_k_ci(R, 3)