Paper

Don’t Pass@𝑘: A Bayesian Framework for Large Language Model Evaluation

October 21, 2025 Mohsen Hariri, Amirhossein Samandar, Michael Hinczewski, Vipin Chaudhary

A Bayesian framework for evaluating large language models that replaces unstable Pass@k metrics with posterior estimates and credible intervals. The method improves sample efficiency, supports graded outcomes, and enables statistically sound model comparisons.

Statistics Bayesian LLMs Inference Reasoning Simulation Test-Time Scaling

paper code poster

Update👋 I’m working on a follow-up to this project. If you’d like to discuss it, my email is in the footer.

TL;DR: Replace Pass@k with a posterior-based protocol, Bayes@N, that gives more stable rankings, credible intervals, and a clear decision rule while handling both binary and graded evaluation. Avg@N can be reported alongside it, since the two are rank-equivalent under a uniform prior.

Story Behind the Paper

We were benchmarking reasoning models such as DeepSeek-R1 and Qwen-thinking, and kept coming back to a deceptively simple question: how many trials are enough? We took 11 models from several families, including DeepSeek-distilled, Qwen, Microsoft, and Nvidia, ran 8 trials, and recorded the ranking. Out of curiosity, we ran one more trial, and the leaderboard changed. At 16 trials it changed again. We pushed to N = 80, and even then AIME’25 did not settle into a stable order. That instability led us to study the problem systematically, including with simulations. With non-deterministic decoding such as top_p, how many trials do we actually need? And if we knew the right N, which evaluation metric would best estimate a model’s true performance? This paper answers those questions.

Using It in Practice

scorio is available for Python with pip install scorio and for Julia with pkg> add Scorio. Besides Bayes@N, it supports avg@N, Pass@k, Pass^k, G-Pass@k, and mG-Pass@k. The source code is at https://mohsenhariri.github.io/scorio, with documentation at https://scorio.readthedocs.io. For worked examples, see the Bayes evaluation tutorial.

Consider a dataset with M questions. For each question, the LLM generates N independent trials (samples), and each trial is evaluated with a rubric that assigns one of C + 1 categories: (0, 1, \ldots, C). The results are stored in an M \times N matrix R, where R_{\alpha i} denotes the category of the i-th trial for question \alpha. We can also include prior runs in a matrix R^{0} with D trials per question.

The simplest and most common rubric is binary (C = 1), where each trial is either correct (1) or incorrect (0). In that setting, the results matrix R contains only 0s and 1s, as in the example below:

Python

import numpy as np
from scorio import eval
 
R_binary = np.array([
    [1, 1, 1, 1, 1],
    [1, 1, 1, 0, 1],
    [1, 0, 0, 1, 0],
    [0, 0, 1, 0, 0],
], dtype=int)
 
w = [0, 1]  # 0 for incorrect, 1 for correct
mu, sigma = eval.bayes(R_binary, w)

Julia

using Scorio
 
w = [0, 1]
mu, sigma = bayes(R, w)

Function	Returns	Description
`bayes(R, w, R0)`	`(\mu, \sigma)`	Bayesian posterior mean and uncertainty
`bayes_ci(R, w, R0)`	`(\mu, \sigma, \mathrm{lo}, \mathrm{hi})`	Bayesian posterior mean and uncertainty + credible interval
`avg(R, w)`	`(a, \sigma_a)`	Weighted average and uncertainty
`avg_ci(R, w)`	`(a, \sigma_a, \mathrm{lo}, \mathrm{hi})`	Weighted average and uncertainty + credible interval
`pass_at_k(R, k)`	`p`	Pass@`k` estimate
`pass_at_k_ci(R, k)`	`(\mu, \sigma, \mathrm{lo}, \mathrm{hi})`	Pass@`k` estimate + credible interval
`pass_hat_k(R, k)`	`p`	Pass^`k`
`pass_hat_k_ci(R, k)`	`(\mu, \sigma, \mathrm{lo}, \mathrm{hi})`	Pass^`k` + credible interval
`g_pass_at_k_tau(R, k, tau)`	`p`	G-Pass@`k_{\tau}`
`mg_pass_at_k(R, k)`	`p`	mG-Pass@`k`

Abstract

Pass@k is widely used to report LLM reasoning performance, but with limited samples it can produce unstable and misleading rankings. We present a Bayesian evaluation framework that replaces Pass@k and average accuracy over N trials with posterior estimates of a model’s underlying success probability and credible intervals. This gives more stable rankings and a transparent rule for deciding when two models differ. Evaluation outcomes are modeled as categorical rather than just 0/1 with a Dirichlet prior, giving closed-form expressions for the posterior mean and uncertainty of any weighted rubric and allowing prior evidence when appropriate. Under a uniform prior, the Bayesian posterior mean is order-equivalent to average accuracy, or Pass@1, which helps explain its empirical stability while adding uncertainty estimates. In simulations with known ground-truth success rates, and on AIME’24/’25, HMMT’25, and BrUMO’25, the Bayesian/avg procedure converges faster and gives more stable rankings than Pass@k and recent variants, making reliable comparisons possible with far fewer samples. The framework separates statistically meaningful gaps, where credible intervals do not overlap, from noise, and extends naturally to graded, rubric-based evaluations. These results support replacing Pass@k for LLM evaluation and ranking with a posterior-based, compute-efficient protocol that handles binary and non-binary evaluation while making uncertainty explicit. Source code is available at https://mohsenhariri.github.io/scorio.

Evaluating the Evaluation Metrics

We start by fixing a gold standard ranking using a large trial budget. For each metric, such as avg@N, Pass@k, and its variants, we compute the ranking at smaller n, where n \in [1, N] (or n \in [k, N] for the Pass@k family), and compare it with the gold standard using Kendall’s \tau across bootstrap replicates. In our experiments on AIME’24 and AIME’25, HMMT’25, and BrUMO’25, Bayes@N and avg@N climb to higher \tau faster and reach a better plateau than Pass@k methods, indicating quicker and more stable agreement with the reference ranking.

To make uncertainty explicit, we use credible intervals and a clear decision rule. We approximate the posterior for \bar{\pi} as normal and compare two models with the z score

z = \frac{|\mu - \mu'|}{\sqrt{\sigma^{2} + (\sigma')^{2}}},

and map z to a ranking confidence

\rho = \tfrac{1}{2}\!\left(1 + \mathrm{erf}\!\left(\frac{z}{\sqrt{2}}\right)\right).

We only call a winner when credible intervals do not overlap or when z exceeds a preset threshold, such as 1.645 for approximately 95\%.

Finally, we report convergence@N, the smallest N after which a method’s ranking matches the gold standard and stays fixed, and estimate its distribution via bootstrap. In practice, we summarize this with convergence(k) PMFs and CDFs. On the math reasoning benchmarks above, Bayes@N converges reliably with fewer trials than Pass@k variants, which sometimes never settle. A lower mean convergence@N indicates a more compute-efficient evaluation metric.

Why Pass@`k` Is Problematic

Pass@k produces high variance and unstable rankings at small evaluation budgets, which makes model comparisons noisy. It also gives no direct uncertainty estimate; bootstrapping can be unreliable when N is small. In practice, Pass@k converges more slowly to the true model ranking and often fails to stabilize within a reasonable trial budget, unlike Bayes@N or avg@N. Because Pass@k reduces outcomes to a simple “any success” criterion, it is also poorly suited to graded or rubric-based evaluations, where partial or categorical correctness matters.

Simulation

Try out the interactive demo below.

Open interactive module

Algorithm

In principle, we could estimate \boldsymbol{\pi}_\alpha by running an arbitrarily large number of trials with the LLM, yielding an accurate estimate of \bar{\pi}. In practice, compute limits usually force us to work with small N. Our goal is to use a Bayesian approach to estimate \bar{\pi} and its uncertainty from a finite N.

The first step is to construct the posterior distribution

\mathcal{P}(\boldsymbol{\pi}_\alpha \mid \boldsymbol{R}_\alpha),

the probability of \boldsymbol{\pi}_\alpha given the \alpha-th row of the matrix R, denoted \boldsymbol{R}_\alpha. This posterior depends on the observed data \boldsymbol{R}_\alpha and a chosen prior distribution \mathcal{P}(\boldsymbol{\pi}_\alpha) for the unknown underlying probability vector \boldsymbol{\pi}_\alpha. The prior can be uniform, when no prior information is available, or it can encode previously gathered evidence about the LLM’s performance.

The Bayesian estimate uses two key quantities:

Posterior mean of \bar{\pi}, denoted \mu(R), which is the mean of \bar{\pi} over the joint posterior for all questions:
$\mu(R) = \mathbb{E}_{\{\boldsymbol{\pi}_\alpha\} \mid R}\left[ \bar{\pi} \right].$
This is the Bayesian optimal estimator, minimizing the quadratic loss function
$\mathcal{L}(\bar{\pi}^{\mathrm{est}}) = \mathbb{E}_{R, \boldsymbol{\pi}_\alpha} \left[ \bar{\pi}^{\mathrm{est}}(R) - \bar{\pi} \right]^2,$
over all possible estimators \bar{\pi}^{\mathrm{est}}(R), where the expectation is taken over all possible \boldsymbol{\pi}_\alpha and realizations of R.
Posterior variance, denoted \sigma^2(R), which quantifies uncertainty in the estimate \mu:
$\sigma^2(R) = \mathrm{Var}_{\{\boldsymbol{\pi}_\alpha\} \mid R}\left[ \bar{\pi} \right].$

Both \mu(R) and \sigma^2(R) have exact closed-form expressions, derived in Appendix A of the paper, and can be computed efficiently for any R using the algorithm below.

\small \begin{aligned} &\textbf{Algorithm: } \text{LLM performance evaluation using the } \texttt{Bayes}(N) \text{ framework} \\ \\ &\textbf{Function } \texttt{EvaluatePerformance}(R, [R^0], \mathbf{w}) \\ &\quad \textbf{Input: } R \in \mathbb{R}^{M \times N}, \; R_{\alpha i} \in \{0, \ldots, C\} \\ &\quad \phantom{\textbf{Input: }} \mathbf{w} = (w_0,\ldots,w_C) \; \text{weights defining performance metric } \bar{\pi} \\ &\quad \textbf{Optional Input: } R^0 \in \mathbb{R}^{M \times D} \text{ for prior (otherwise } D = 0 \text{)} \\ &\quad \textbf{Output: } \mu \text{ (performance estimate), } \sigma \text{ (uncertainty)} \\ \\ & T \gets 1 + C + D + N \\ & \textbf{For } \alpha = 1 \text{ to } M \\ & \quad \textbf{For } k = 0 \text{ to } C \\ & \qquad n_{\alpha k} \gets \sum_{i=1}^N \delta_{k, R_{\alpha i}} \\ & \qquad n^0_{\alpha k} \gets 1 + \sum_{i=1}^D \delta_{k, R^0_{\alpha i}} \\ & \qquad \nu_{\alpha k} \gets n^0_{\alpha k} + n_{\alpha k} \\ & \quad \textbf{End For} \\ & \textbf{End For} \\ \\ & \mu \gets w_0 + \frac{1}{M T} \sum_{\alpha=1}^M \sum_{j=0}^C \nu_{\alpha j}(w_j - w_0) \\ \\ & \sigma \gets \left[ \frac{1}{M^2 (T+1)} \sum_{\alpha=1}^M \left\{ \sum_{j=0}^C \frac{\nu_{\alpha j}}{T}(w_j - w_0)^2 - \left( \sum_{j=0}^C \frac{\nu_{\alpha j}}{T}(w_j - w_0) \right)^2 \right\} \right]^{1/2} \\ \\ & \textbf{return } \mu, \sigma \end{aligned}

Code Examples

Categorical Evaluation with Correctness and token_ratio

Bayes@N can score more than binary correctness. Here, correctness stores the binary outcome and token_ratio marks answers that are too long. The snippet turns those two signals into four categories, then uses w to give full credit to correct, efficient answers, partial credit to correct, verbose answers, and a small score to efficient wrong answers. In other words, concise correct responses receive full credit, verbose correct responses receive partial credit, and efficient wrong responses receive a small score.

import numpy as np
from scorio import eval
 
correctness = np.array([
    [1, 1, 0, 1],
    [1, 0, 0, 1],
    [0, 1, 1, 1],
    [1, 1, 0, 0],
], dtype=int)
 
token_ratio = np.array([
    [0.80, 1.60, 0.90, 1.10],
    [1.40, 1.00, 1.70, 0.70],
    [1.80, 0.90, 1.40, 0.95],
    [0.85, 1.30, 1.10, 1.60],
])
 
length_threshold = 1.20
verbose = token_ratio > length_threshold
# 0 = wrong & verbose, 1 = wrong & efficient,
# 2 = correct & verbose, 3 = correct & efficient
R_cat = np.where(
    correctness == 1,
    np.where(verbose, 2, 3),
    np.where(verbose, 0, 1),
).astype(int)
w = np.array([0.0, 0.2, 0.75, 1.0])
 
mu_binary, sigma_binary = eval.bayes(correctness)
mu_cat, sigma_cat = eval.bayes(R_cat, w)
 
# binary evaluation
print(f"mu={mu_binary:.4f}, sigma={sigma_binary:.4f}")
# categorical evaluation
print(f"mu={mu_cat:.4f}, sigma={sigma_cat:.4f}")

Greedy Run as Prior Evidence for top_p Samples

top_p_runs contains seven stochastic samples for each question. greedy_prior adds one deterministic greedy run per question as R0, so the posterior starts from that earlier evidence rather than from a uniform prior alone. The two bayes_ci calls report the estimate with and without the prior; in this toy data, the greedy run nudges the mean upward and narrows the interval.

import numpy as np
from scorio import eval
 
# M = 5 questions, N = 7 sampled runs per question
top_p_runs = np.array([
    [1, 1, 1, 1, 0, 1, 1],
    [1, 0, 0, 1, 0, 0, 1],
    [0, 0, 0, 0, 1, 0, 0],
    [1, 1, 1, 0, 1, 1, 0],
    [0, 0, 1, 0, 0, 0, 0],
], dtype=int)
 
greedy_prior = np.array([
    [1],
    [1],
    [0],
    [1],
    [0],
], dtype=int)
 
mu, sigma, lo, hi = eval.bayes_ci(top_p_runs)
mu_prior, sigma_prior, lo_p, hi_p = eval.bayes_ci(top_p_runs, R0=greedy_prior)
 
# without prior: mu=0.4667, sigma=0.0629, 95% CrI=[0.3435, 0.5899]
print(f"mu={mu:.4f}, sigma={sigma:.4f}, " f"95% CrI=[{lo:.4f}, {hi:.4f}]")
# with greedy prior: mu=0.4800, sigma=0.0585, 95% CrI=[0.3654, 0.5946]
print(f"mu={mu_prior:.4f}, sigma={sigma_prior:.4f}, " f"95% CrI=[{lo_p:.4f}, {hi_p:.4f}]")

For more detail, choose the in-depth version from the left sidebar.