Don’t Pass@𝑘: A Bayesian Framework for Large Language Model Evaluation

TL;DR: Replace Pass@k with a posterior-based protocol (Bayes@N) that (i) yields stable, faster-converging rankings, (ii) provides credible intervals and a clear decision rule, and (iii) naturally unifies binary and graded evaluation; avg@N can be reported alongside since it’s rank-equivalent under a uniform prior. alt text

Story behind

We were benchmarking reasoning models (e.g., DeepSeek-R1, Qwen-thinking) and kept circling the same deceptively simple question: how many trials are enough? With no oracle to consult, we took 11 models from different families (DeepSeek-distilled, Qwen, Microsoft, Nvidia, etc.), ran 8 trials, and noted the ranking, then out of curiosity ran one more, and the leaderboard changed. At 16 trials it flipped again. We pushed all the way to N = 80, and even there AIME’25 refused to settle into a truly stable ordering. That instability pushed us to study the problem systematically (including simulations): (i) since we use non-deterministic decoding (e.g., top_p), how many trials are actually required and if an oracle told us the right N, which evaluation metric best estimates the model’s true performance? This paper is the answer to those questions.

How to use in practice?

We release scorio, which can be installed via pip using pip install scorio, or in Julia with pkg> add Scorio. In addition to Bayes@N, scorio supports many evaluation metrics, including avg@N, Pass@k, Pass^k, G-Pass@k, and mG-Pass@k. The source code and documentation are available at https://mohsenhariri.github.io/bayes-kit.

Consider a dataset with $M$ questions. An LLM is used to generate $N$ independent trials (samples) per question and evaluate each trial using a rubric that assigns it to one of $C + 1$ categories: $(0, 1, \ldots, C)$ . The results are stored in an $M \times N$ matrix $R$ , where $R_{\alpha i}$ denotes the category of the $i$ -th trial for question $\alpha$ . Optionally, you may include prior runs in a matrix $R^{0}$ with $D$ trials per question.

The simplest and most common rubric is binary ( $C = 1$ ), where each trial is either correct (1) or incorrect (0). In this case, the results matrix $R$ contains only 0s and 1s. Suppose the dataset has $M$ questions and you run $N$ trials per question, resulting in an $M \times N$ matrix $R$ . See the example below:

from scorio import eval
 
w = [0, 1]  # 0 for incorrect, 1 for correct
mu, sigma = bayes(R, w)
 
# if you have prior runs, let's say you run N top_p sampling and one greedy decoding (R0)
mu, sigma = bayes(R, w, R0)

using Scorio
 
w = [0, 1]
mu, sigma = bayes(R, w)

Abstract

Pass@k is widely used to report performance for LLM reasoning, but it often yields unstable and misleading rankings, especially when the number of trials (samples) is limited and compute is constrained. We present a principled Bayesian evaluation framework that replaces Pass@k and average accuracy over N trials with posterior estimates of a model's underlying success probability and credible intervals, yielding stable rankings and a transparent decision rule for differences. Evaluation outcomes are modeled as categorical (not just 0/1) with a Dirichlet prior, giving closed-form expressions for the posterior mean and uncertainty of any weighted rubric and enabling the use of prior evidence when appropriate. Theoretically, under a uniform prior, the Bayesian posterior mean is order-equivalent to average accuracy (Pass@1), explaining its empirical robustness while adding principled uncertainty. Empirically, in simulations with known ground-truth success rates and on AIME’24/’25, HMMT’25, and BrUMO’25, the Bayesian/avg procedure achieves faster convergence and greater rank stability than Pass@k and recent variants, enabling reliable comparisons at far smaller sample counts. The framework clarifies when observed gaps are statistically meaningful (non-overlapping credible intervals) versus noise, and it naturally extends to graded, rubric-based evaluations. Together, these results recommend replacing Pass@k for LLM evaluation and ranking with a posterior-based, compute-efficient protocol that unifies binary and non-binary evaluation while making uncertainty explicit. Source code is available at https://mohsenhariri.github.io/bayes-kit.

How to evaluate the evaluation metrics?

We start by fixing a gold standard ranking using a large trial budget. We then take each evaluation metric such as avg@ $N$ , Pass@ $k$ , and its variants, compute its ranking at smaller $n$ where $n \in [1, N]$ (or $n \in [k, N]$ in the case of the Pass@ $k$ family), and compare it to the gold standard using Kendall’s $\tau$ across bootstrap replicates. In our experiments on AIME’24 and AIME’25, HMMT’25, and BrUMO’25, Bayes@ $N$ and thus avg@ $N$ climbs to higher $\tau$ faster and reaches a better plateau than Pass@ $k$ style methods, which reflects quicker and more stable agreement with the reference.

We make uncertainty explicit with credible intervals and a clear decision rule, approximate the posterior for $\bar{\pi}$ as normal, compare two models with the $z$ score

z = \frac{|\mu - \mu'|}{\sqrt{\sigma^{2} + (\sigma')^{2}}},

and map $z$ to a ranking confidence

\rho = \tfrac{1}{2}\!\left(1 + \mathrm{erf}\!\left(\frac{z}{\sqrt{2}}\right)\right).

We only call a winner when credible intervals do not overlap or when $z$ exceeds a preset threshold, for example $1.645$ for approximately $95\%$ .

Finally, we report convergence@ $N$ , the smallest $N$ after which a method’s ranking matches the gold standard and stays fixed, estimating its distribution via bootstrap. Practically, we summarize this with convergence( $k$ ) PMFs and CDFs. On the math reasoning benchmarks above, Bayes@ $N$ converges reliably with fewer trials than Pass@ $k$ variants, which sometimes never settle. A lower mean convergence@ $N$ indicates a more compute efficient evaluation metric.

Why Pass@k is Problematic?

Pass@ $k$ produces high variance and unstable rankings at small evaluation budgets, leading to noisy comparisons between models. It also struggles with quantifying uncertainty, since estimating confidence requires bootstrapping, which becomes unreliable when $N$ is small. In practice, Pass@ $k$ converges more slowly to the true model ranking and often fails to stabilize within a reasonable trial budget, unlike Bayes@ $N$ or avg@ $N$ . Moreover, Pass@ $k$ reduces outcomes to a simple “any success” criterion, making it unsuitable for graded or rubric-based evaluations, which are crucial for reasoning and RL-fine-tuned models where partial or categorical correctness matters.

Simulation

Try out the interactive demo below.

Algorithm

In principle, we could estimate $\boldsymbol{\pi}_\alpha$ by running an arbitrarily large number of trials with the LLM, yielding an accurate estimate of $\bar{\pi}$ . However, we are typically constrained to small $N$ due to limited computational resources. Our goal is to develop a Bayesian approach to estimate $\bar{\pi}$ and its associated uncertainty given a finite $N$ .

The first step is to construct the posterior distribution

\mathcal{P}(\boldsymbol{\pi}_\alpha \mid \boldsymbol{R}_\alpha),

the probability of $\boldsymbol{\pi}_\alpha$ given the $\alpha$ -th row of the matrix $R$ , denoted $\boldsymbol{R}_\alpha$ . This posterior depends on the observed data $\boldsymbol{R}_\alpha$ and a chosen prior distribution $\mathcal{P}(\boldsymbol{\pi}_\alpha)$ for the unknown underlying probability vector $\boldsymbol{\pi}_\alpha$ . The prior could be uniform (assuming no prior information) or incorporate previously gathered evidence about the LLM's performance.

The Bayesian framework focuses on two key quantities:

Posterior mean of $\bar{\pi}$ , denoted $\mu(R)$ , which is the mean of $\bar{\pi}$ over the joint posterior for all questions:
$\mu(R) = \mathbb{E}_{\{\boldsymbol{\pi}_\alpha\} \mid R}\left[ \bar{\pi} \right].$
This is the Bayesian optimal estimator, minimizing the quadratic loss function
$\mathcal{L}(\bar{\pi}^{\mathrm{est}}) = \mathbb{E}_{R, \boldsymbol{\pi}_\alpha} \left[ \bar{\pi}^{\mathrm{est}}(R) - \bar{\pi} \right]^2,$
over all possible estimators $\bar{\pi}^{\mathrm{est}}(R)$ , where the expectation is taken over all possible $\boldsymbol{\pi}_\alpha$ and realizations of $R$ .
Posterior variance, denoted $\sigma^2(R)$ , which quantifies the uncertainty of the $\mu$ estimate:
$\sigma^2(R) = \mathrm{Var}_{\{\boldsymbol{\pi}_\alpha\} \mid R}\left[ \bar{\pi} \right].$

Both $\mu(R)$ and $\sigma^2(R)$ have exact closed-form expressions, derived in Appendix A of the paper, and can be efficiently computed for any $R$ using the Algorithm below.

\small \begin{aligned} &\textbf{Algorithm: } \text{LLM performance evaluation using the } \texttt{Bayes}(N) \text{ framework} \\ \\ &\textbf{Function } \texttt{EvaluatePerformance}(R, [R^0], \mathbf{w}) \\ &\quad \textbf{Input: } R \in \mathbb{R}^{M \times N}, \; R_{\alpha i} \in \{0, \ldots, C\} \\ &\quad \phantom{\textbf{Input: }} \mathbf{w} = (w_0,\ldots,w_C) \; \text{weights defining performance metric } \bar{\pi} \\ &\quad \textbf{Optional Input: } R^0 \in \mathbb{R}^{M \times D} \text{ for prior (otherwise } D = 0 \text{)} \\ &\quad \textbf{Output: } \mu \text{ (performance estimate), } \sigma \text{ (uncertainty)} \\ \\ & T \gets 1 + C + D + N \\ & \textbf{For } \alpha = 1 \text{ to } M \\ & \quad \textbf{For } k = 0 \text{ to } C \\ & \qquad n_{\alpha k} \gets \sum_{i=1}^N \delta_{k, R_{\alpha i}} \\ & \qquad n^0_{\alpha k} \gets 1 + \sum_{i=1}^D \delta_{k, R^0_{\alpha i}} \\ & \qquad \nu_{\alpha k} \gets n^0_{\alpha k} + n_{\alpha k} \\ & \quad \textbf{End For} \\ & \textbf{End For} \\ \\ & \mu \gets w_0 + \frac{1}{M T} \sum_{\alpha=1}^M \sum_{j=0}^C \nu_{\alpha j}(w_j - w_0) \\ \\ & \sigma \gets \left[ \frac{1}{M^2 (T+1)} \sum_{\alpha=1}^M \left\{ \sum_{j=0}^C \frac{\nu_{\alpha j}}{T}(w_j - w_0)^2 - \left( \sum_{j=0}^C \frac{\nu_{\alpha j}}{T}(w_j - w_0) \right)^2 \right\} \right]^{1/2} \\ \\ & \textbf{return } \mu, \sigma \end{aligned}