Don’t Pass@𝑘: A Bayesian Framework for Large Language Model Evaluation

  • 1Department of Computer and Data Sciences, Case Western Reserve University, Cleveland, OH, USA
  • 2Department of Physics, Case Western Reserve University, Cleveland, OH, USA

TL;DR: Replace Pass@k with a posterior-based protocol (Bayes@N) that (i) yields stable, faster-converging rankings, (ii) provides credible intervals and a clear decision rule, and (iii) naturally unifies binary and graded evaluation; avg@N can be reported alongside since it’s rank-equivalent under a uniform prior. alt text

Story behind

We were benchmarking reasoning models (e.g., DeepSeek-R1, Qwen-thinking) and kept circling the same deceptively simple question: how many trials are enough? With no oracle to consult, we took 11 models from different families (DeepSeek-distilled, Qwen, Microsoft, Nvidia, etc.), ran 8 trials, and noted the ranking, then out of curiosity ran one more, and the leaderboard changed. At 16 trials it flipped again. We pushed all the way to N = 80, and even there AIME’25 refused to settle into a truly stable ordering. That instability pushed us to study the problem systematically (including simulations): (i) since we use non-deterministic decoding (e.g., top_p), how many trials are actually required and if an oracle told us the right N, which evaluation metric best estimates the model’s true performance? This paper is the answer to those questions.

How to use in practice?

We release scorio, which can be installed via pip using pip install scorio, or in Julia with pkg> add Scorio. In addition to Bayes@N, scorio supports many evaluation metrics, including avg@N, Pass@k, Pass^k, G-Pass@k, and mG-Pass@k. The source code and documentation are available at https://mohsenhariri.github.io/bayes-kit.

Consider a dataset with MM questions. An LLM is used to generate NN independent trials (samples) per question and evaluate each trial using a rubric that assigns it to one of C+1C + 1 categories: (0,1,,C)(0, 1, \ldots, C). The results are stored in an M×NM \times N matrix RR, where RαiR_{\alpha i} denotes the category of the ii-th trial for question α\alpha. Optionally, you may include prior runs in a matrix R0R^{0} with DD trials per question.

The simplest and most common rubric is binary (C=1C = 1), where each trial is either correct (1) or incorrect (0). In this case, the results matrix RR contains only 0s and 1s. Suppose the dataset has MM questions and you run NN trials per question, resulting in an M×NM \times N matrix RR. See the example below:

from scorio import eval
 
w = [0, 1]  # 0 for incorrect, 1 for correct
mu, sigma = bayes(R, w)
 
# if you have prior runs, let's say you run N top_p sampling and one greedy decoding (R0)
mu, sigma = bayes(R, w, R0)
using Scorio
 
w = [0, 1]
mu, sigma = bayes(R, w)

Abstract

Pass@k is widely used to report performance for LLM reasoning, but it often yields unstable and misleading rankings, especially when the number of trials (samples) is limited and compute is constrained. We present a principled Bayesian evaluation framework that replaces Pass@k and average accuracy over N trials with posterior estimates of a model's underlying success probability and credible intervals, yielding stable rankings and a transparent decision rule for differences. Evaluation outcomes are modeled as categorical (not just 0/1) with a Dirichlet prior, giving closed-form expressions for the posterior mean and uncertainty of any weighted rubric and enabling the use of prior evidence when appropriate. Theoretically, under a uniform prior, the Bayesian posterior mean is order-equivalent to average accuracy (Pass@1), explaining its empirical robustness while adding principled uncertainty. Empirically, in simulations with known ground-truth success rates and on AIME’24/’25, HMMT’25, and BrUMO’25, the Bayesian/avg procedure achieves faster convergence and greater rank stability than Pass@k and recent variants, enabling reliable comparisons at far smaller sample counts. The framework clarifies when observed gaps are statistically meaningful (non-overlapping credible intervals) versus noise, and it naturally extends to graded, rubric-based evaluations. Together, these results recommend replacing Pass@k for LLM evaluation and ranking with a posterior-based, compute-efficient protocol that unifies binary and non-binary evaluation while making uncertainty explicit. Source code is available at https://mohsenhariri.github.io/bayes-kit.

How to evaluate the evaluation metrics?

We start by fixing a gold standard ranking using a large trial budget. We then take each evaluation metric such as avg@NN, Pass@kk, and its variants, compute its ranking at smaller nn where n[1,N]n \in [1, N] (or n[k,N]n \in [k, N] in the case of the Pass@kk family), and compare it to the gold standard using Kendall’s τ\tau across bootstrap replicates. In our experiments on AIME’24 and AIME’25, HMMT’25, and BrUMO’25, Bayes@NN and thus avg@NN climbs to higher τ\tau faster and reaches a better plateau than Pass@kk style methods, which reflects quicker and more stable agreement with the reference.

We make uncertainty explicit with credible intervals and a clear decision rule, approximate the posterior for πˉ\bar{\pi} as normal, compare two models with the zz score

z=μμσ2+(σ)2,z = \frac{|\mu - \mu'|}{\sqrt{\sigma^{2} + (\sigma')^{2}}},

and map zz to a ranking confidence

ρ=12 ⁣(1+erf ⁣(z2)).\rho = \tfrac{1}{2}\!\left(1 + \mathrm{erf}\!\left(\frac{z}{\sqrt{2}}\right)\right).

We only call a winner when credible intervals do not overlap or when zz exceeds a preset threshold, for example 1.6451.645 for approximately 95%95\%.

Finally, we report convergence@NN, the smallest NN after which a method’s ranking matches the gold standard and stays fixed, estimating its distribution via bootstrap. Practically, we summarize this with convergence(kk) PMFs and CDFs. On the math reasoning benchmarks above, Bayes@NN converges reliably with fewer trials than Pass@kk variants, which sometimes never settle. A lower mean convergence@NN indicates a more compute efficient evaluation metric.

Why Pass@k is Problematic?

Pass@kk produces high variance and unstable rankings at small evaluation budgets, leading to noisy comparisons between models. It also struggles with quantifying uncertainty, since estimating confidence requires bootstrapping, which becomes unreliable when NN is small. In practice, Pass@kk converges more slowly to the true model ranking and often fails to stabilize within a reasonable trial budget, unlike Bayes@NN or avg@NN. Moreover, Pass@kk reduces outcomes to a simple “any success” criterion, making it unsuitable for graded or rubric-based evaluations, which are crucial for reasoning and RL-fine-tuned models where partial or categorical correctness matters.

Simulation

Try out the interactive demo below.

Algorithm

In principle, we could estimate πα\boldsymbol{\pi}_\alpha by running an arbitrarily large number of trials with the LLM, yielding an accurate estimate of πˉ\bar{\pi}. However, we are typically constrained to small NN due to limited computational resources. Our goal is to develop a Bayesian approach to estimate πˉ\bar{\pi} and its associated uncertainty given a finite NN.

The first step is to construct the posterior distribution

P(παRα),\mathcal{P}(\boldsymbol{\pi}_\alpha \mid \boldsymbol{R}_\alpha),

the probability of πα\boldsymbol{\pi}_\alpha given the α\alpha-th row of the matrix RR, denoted Rα\boldsymbol{R}_\alpha. This posterior depends on the observed data Rα\boldsymbol{R}_\alpha and a chosen prior distribution P(πα)\mathcal{P}(\boldsymbol{\pi}_\alpha) for the unknown underlying probability vector πα\boldsymbol{\pi}_\alpha. The prior could be uniform (assuming no prior information) or incorporate previously gathered evidence about the LLM's performance.

The Bayesian framework focuses on two key quantities:

  1. Posterior mean of πˉ\bar{\pi}, denoted μ(R)\mu(R), which is the mean of πˉ\bar{\pi} over the joint posterior for all questions:

    μ(R)=E{πα}R[πˉ].\mu(R) = \mathbb{E}_{\{\boldsymbol{\pi}_\alpha\} \mid R}\left[ \bar{\pi} \right].

    This is the Bayesian optimal estimator, minimizing the quadratic loss function

    L(πˉest)=ER,πα[πˉest(R)πˉ]2,\mathcal{L}(\bar{\pi}^{\mathrm{est}}) = \mathbb{E}_{R, \boldsymbol{\pi}_\alpha} \left[ \bar{\pi}^{\mathrm{est}}(R) - \bar{\pi} \right]^2,

    over all possible estimators πˉest(R)\bar{\pi}^{\mathrm{est}}(R), where the expectation is taken over all possible πα\boldsymbol{\pi}_\alpha and realizations of RR.

  2. Posterior variance, denoted σ2(R)\sigma^2(R), which quantifies the uncertainty of the μ\mu estimate:

    σ2(R)=Var{πα}R[πˉ].\sigma^2(R) = \mathrm{Var}_{\{\boldsymbol{\pi}_\alpha\} \mid R}\left[ \bar{\pi} \right].

Both μ(R)\mu(R) and σ2(R)\sigma^2(R) have exact closed-form expressions, derived in Appendix A of the paper, and can be efficiently computed for any RR using the Algorithm below.

Algorithm: LLM performance evaluation using the Bayes(N) frameworkFunction EvaluatePerformance(R,[R0],w)Input: RRM×N,  Rαi{0,,C}Input: w=(w0,,wC)  weights defining performance metric πˉOptional Input: R0RM×D for prior (otherwise D=0)Output: μ (performance estimate), σ (uncertainty)T1+C+D+NFor α=1 to MFor k=0 to Cnαki=1Nδk,Rαinαk01+i=1Dδk,Rαi0ναknαk0+nαkEnd ForEnd Forμw0+1MTα=1Mj=0Cναj(wjw0)σ[1M2(T+1)α=1M{j=0CναjT(wjw0)2(j=0CναjT(wjw0))2}]1/2return μ,σ\small \begin{aligned} &\textbf{Algorithm: } \text{LLM performance evaluation using the } \texttt{Bayes}(N) \text{ framework} \\ \\ &\textbf{Function } \texttt{EvaluatePerformance}(R, [R^0], \mathbf{w}) \\ &\quad \textbf{Input: } R \in \mathbb{R}^{M \times N}, \; R_{\alpha i} \in \{0, \ldots, C\} \\ &\quad \phantom{\textbf{Input: }} \mathbf{w} = (w_0,\ldots,w_C) \; \text{weights defining performance metric } \bar{\pi} \\ &\quad \textbf{Optional Input: } R^0 \in \mathbb{R}^{M \times D} \text{ for prior (otherwise } D = 0 \text{)} \\ &\quad \textbf{Output: } \mu \text{ (performance estimate), } \sigma \text{ (uncertainty)} \\ \\ & T \gets 1 + C + D + N \\ & \textbf{For } \alpha = 1 \text{ to } M \\ & \quad \textbf{For } k = 0 \text{ to } C \\ & \qquad n_{\alpha k} \gets \sum_{i=1}^N \delta_{k, R_{\alpha i}} \\ & \qquad n^0_{\alpha k} \gets 1 + \sum_{i=1}^D \delta_{k, R^0_{\alpha i}} \\ & \qquad \nu_{\alpha k} \gets n^0_{\alpha k} + n_{\alpha k} \\ & \quad \textbf{End For} \\ & \textbf{End For} \\ \\ & \mu \gets w_0 + \frac{1}{M T} \sum_{\alpha=1}^M \sum_{j=0}^C \nu_{\alpha j}(w_j - w_0) \\ \\ & \sigma \gets \left[ \frac{1}{M^2 (T+1)} \sum_{\alpha=1}^M \left\{ \sum_{j=0}^C \frac{\nu_{\alpha j}}{T}(w_j - w_0)^2 - \left( \sum_{j=0}^C \frac{\nu_{\alpha j}}{T}(w_j - w_0) \right)^2 \right\} \right]^{1/2} \\ \\ & \textbf{return } \mu, \sigma \end{aligned}

Citation

@article{hariri2025don,
  title={Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation},
  author={Hariri, Mohsen and Samandar, Amirhossein and Hinczewski, Michael and Chaudhary, Vipin},
  journal={arXiv preprint arXiv:2510.04265},
  year={2025}
}

Acknowledgments

This research was supported in part by NSF awards 2117439, 2112606, and 2320952.

Contact

For questions or correspondence, don't hesitate to reach out