Case Western Reserve University
Scorio National Science Foundation
Poster detail level

Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation

Mohsen Hariri Amirhossein Samandar Michael Hinczewski Vipin Chaudhary

Why replace Pass@k?

  • Pass@k estimates the probability of at least one success in k tries, not a model's underlying success probability.
  • With limited samples, rankings can be unstable and sensitive to decoding and sampling noise.
  • It lacks a simple closed-form uncertainty estimate and does not naturally extend to partial-credit rubrics.
  1. 1 Run N samples per question
  2. 2 Score each attempt (binary/rubric)
  3. 3 Tally category counts
  4. 4 Dirichlet posterior over outcomes
  5. 5 Report μ, σ, and CIs

Bayes@N turns evaluation into posterior inference over question-level outcome probabilities.

Convergence and Rank Stability

Probability-mass and rank-trace plots comparing Bayes@N with pass@2, pass@4, and pass@8 across AIME24, AIME25, HMMT25, and BrUMO25.
Convergence@n PMFs compare Bayes@N with pass@2/4/8 across AIME24/25, HMMT25, and BrUMO25. Ranking traces show fast convergence on BrUMO25 and non-convergence on AIME25.

Convergence@n
PMFs compare Bayes@N with pass@2/4/8 across AIME'24/'25, HMMT'25, and BrUMO'25.

Ranking traces show fast convergence on BrUMO'25 and non-convergence on AIME'25.

Benchmarks: rankings with fewer trials

Four benchmark plots showing Bayes@N and Pass variants across AIME24, AIME25, HMMT25, and BrUMO25 with Kendall tau rising faster for Bayes@N.
Bayes@N reaches high agreement with the gold-standard ranking quickly, while Pass@k variants require substantially more trials.
  • On AIME'24, AIME'25, HMMT'25, and BrUMO'25, Bayes@N reaches τ > 0.90 by N = 10 and tracks the gold-standard ranking much faster than Pass@k.
  • AIME'25 remains partially unresolved even at N = 80, showing why interval-aware reporting is necessary.

Bayesian estimator and the decision rule

Posterior on question α:

$$\pi_{\alpha} \mid R_{\alpha} \sim \mathrm{Dir}\left(n^{0}_{\alpha 0} + n_{\alpha 0}, \ldots, n^{0}_{\alpha C} + n_{\alpha C}\right)$$

Weighted rubric metric:

$$\bar{\pi} = \frac{1}{M} \sum_{\alpha} \sum_{k} w_k \pi_{\alpha k}$$

  • Output both a posterior mean \(\mu(R)\) and exact uncertainty \(\sigma(R)\).
  • Binary evaluation is a special case; categorical rubrics use the same machinery.
  • Under a uniform prior in the binary case, Bayes@N has the same ordering as avg@N / Pass@1, but adds principled uncertainty.
Illustrative Bayesian credible-interval plots showing how overlap determines whether a ranking is resolved.
Recommended protocol: sample N attempts, score each attempt with a binary or rubric-aware labeler, report Bayes@N with CIs, then extend N only if needed.

Practical rule
If credible intervals overlap, do not call a winner. For pairwise rankings, z = 1.645 implies 95% confidence in the ordering.

Recommended protocol: sample N attempts, score each attempt with a binary or rubric-aware labeler, report Bayes@N with credible intervals, then extend N only if needed.

Leaderboard grows -> interval-aware evaluation matters more

Heatmaps showing convergence worsening as the number of compared models increases from 5 to 15.
Scaling model comparisons makes interval-aware stopping rules essential: strict point-estimate rankings become increasingly brittle as the leaderboard widens.
  • Mean convergence@n worsens as the number of compared models increases from 5 to 15.
  • Without credible intervals, many subsets fail to stabilize within 80 trials; scaling to 20 models makes the problem severe on all four datasets.
  • Posterior intervals provide a transparent stopping rule when strict point-estimate rankings are not yet trustworthy.

Beyond binary correctness: rubric-aware Bayes@N

Categorical ranking plot showing how model positions change under different rubric schemas.
Ranking across categorical schemas remains interpretable because posterior means and uncertainty stay on the same footing.
  • Treat outcomes as categorical: correct, partial credit, format errors, refusals, verifier signals, efficiency penalties, and more.
  • Different weight vectors encode different evaluation goals while keeping posterior uncertainty explicit.
  • Across schemas, Qwen3-Thinking remains first; rubric choice mainly reshuffles the middle of the pack.

Takeaway: replace Pass@k with Bayes@N + credible intervals for stable, uncertainty-aware ranking, and use categorical scoring beyond 0/1 correctness.

Acknowledgment: This research was supported in part by NSF awards 2117439, 2112606, & 2320952.