Don’t Pass@𝑘: A Bayesian Framework for Large Language Model Evaluation

Mohsen Hariri; Amirhossein Samandar; Michael Hinczewski; Vipin Chaudhary

Abstract

Pass@ $k$ is widely used to report the reasoning performance of LLMs, but it often produces unstable and potentially misleading rankings, especially when the number of trials (samples) is limited and computational resources are constrained. We present a principled Bayesian evaluation framework that replaces Pass@ $k$ and average accuracy over $N$ trials (avg@ $N$ ) with posterior estimates of a model's underlying success probability and credible intervals, yielding stable rankings and a transparent decision rule for differences. Evaluation outcomes are modeled as categorical (not just 0/1) with a Dirichlet prior, giving closed-form expressions for the posterior mean and uncertainty of any weighted rubric and enabling the use of prior evidence when appropriate. Theoretically, under a uniform prior, the Bayesian posterior mean is order-equivalent to average accuracy (Pass@ $1$ ), explaining its empirical robustness while adding principled uncertainty. Empirically, in simulations with known ground-truth success rates and on AIME'24/'25, HMMT'25, and BrUMO'25, the posterior-based procedure achieves faster convergence and greater rank stability than Pass@ $k$ and recent variants, enabling reliable comparisons at far smaller sample counts. The framework clarifies when observed gaps are statistically meaningful (non-overlapping credible intervals) versus noise, and it naturally extends to graded, rubric-based evaluations. Together, these results recommend replacing Pass@ $k$ for LLM evaluation and ranking with a posterior-based, compute-efficient protocol that unifies binary and non-binary evaluation while making uncertainty explicit. Source code is available at GitHub .

Introduction

Large language models (LLMs) have moved rapidly from research artifacts to everyday infrastructure [1, 2]. Students use them for homework and exam preparation; developers rely on them for code synthesis and refactoring [3]; analysts and clinicians use them for decision support; and agents built atop LLMs are increasingly embedded in workflows across industry and government. This demand has catalyzed unprecedented investment: specialized chips, datacenters, and startups dedicated to LLM training, serving, and tooling [4]. As deployment accelerates, trust, oversight, and comparability become central: how we evaluate LLMs directly shapes which models are adopted, what progress is declared, and how resources are allocated [5, 6, 7, 8, 9, 10, 11].

Evaluation, however, remains the weakest link in the LLM pipeline. Alongside advances in model efficiency and compression[12, 13, 14, 15, 16, 17, 18, 19], training and fine-tuning methods such as parameter-efficient fine-tuning (PEFT), low-rank adaptation (LoRA), and reinforcement learning from human feedback (RLHF) [20, 21, 11], and inference/decoding (sampling strategies, caching, efficient attention) [22, 23], the community still leans on simple, yet flawed, success rates and Pass@ $k$ -style metrics to summarize capabilities [24]. These practices are convenient but fragile. On small or costly benchmarks (e.g., math reasoning sets with only tens of problems such as AIME) [25, 26], Pass@ $k$ or single-run accuracy often produce unstable rankings [27], are sensitive to decoding choices and seed effects [22, 28], and provide little guidance on whether observed gaps are meaningful or mere noise [29, 30]. Averaging across multiple runs ("avg@ $N$ ") helps but is compute-hungry [31], offers no unified way to handle graded/rubric outcomes, and lacks a principled decision rule for significance [29, 32, 33].

This paper takes a different approach: we treat evaluation itself as a statistical inference problem. We introduce a posterior-based framework that replaces Pass@ $k$ and avg@ $N$ with estimates of a model’s underlying success probabilities and associated uncertainty [34]. Outcomes are modeled as categorical [35] rather than purely binary: each item can yield correct, partially correct, formatting-error, refusal, or rubric-defined levels. A Dirichlet prior over these categories yields closed-form posterior means and credible intervals for any weighted rubric, allowing the evaluator to report both a point estimate and principled uncertainty with negligible overhead. In the binary special case under a uniform prior, its posterior mean is order-equivalent to average accuracy, explaining the empirical robustness of avg@ $N$ while making uncertainty explicit.

The framework addresses four persistent pain points. 202 Convergence: as shown in Figure 1, we ideally want methods that can converge to the true underlying ranking with the smallest number of trials, but different approaches can have significantly different convergence speeds. 203 Credible intervals: a simple, transparent rule - do not declare a winner when intervals overlap - reduces leaderboard churn and over-interpretation of tiny gaps by introducing a compute-efficient credible interval (CI). Updates are analytic; one can monitor interval widths online, and allocate additional trials only when needed (no Monte Carlo/bootstrap simulations are required for CI estimation). 204 Categorical evaluation: our approach unifies binary and non-binary evaluation. Graded rubrics are natural in this framework, so one can evaluate step-by-step reasoning, partial credit, or judge categories without ad hoc aggregation. 205 Prior information: we can incorporate prior evidence when appropriate (e.g., reuse of stable rubric distributions across closely related tasks or versions).

We validate the approach in two settings: In controlled simulations with known ground-truth success rates, the posterior procedure converges to correct rankings with fewer samples than Pass@ $k$ and recent variants, and it flags when ties are statistically unresolved. On real math-reasoning benchmarks (AIME'24/'25 [25, 26], HMMT'25 [36], and BrUMO'25 [37]-derived sets), we observe the same pattern: the posterior method achieves greater rank stability at far smaller sample counts than Pass@ $k$ , while clarifying when differences are meaningful versus noise. Practically, this yields a computationally efficient protocol that is easy to implement and audit.

We summarize our contributions as follows:

A unified Bayesian evaluation framework. We model per-item outcomes as categorical with a Dirichlet prior, yielding closed-form posterior means and credible intervals for any weighted rubric, with binary evaluation as a special case. This unifies 0/1 and graded evaluations and supports reuse of prior evidence when justified.
A compute-efficient, interval-aware protocol. We provide a simple recipe: report posterior means with credible intervals; only declare differences when intervals do not overlap; adaptively allocate additional samples until intervals meet pre-specified widths. This protocol naturally supports sequential/online evaluation.
Empirical evidence on simulations and math benchmarks. On synthetic data with known ground truth and on AIME’24/’25, HMMT'25, and BrUMO'25 datasets, our method achieves faster convergence and greater rank stability than Pass@ $k$ and recent variants, enabling reliable comparisons with far fewer samples.

Bayesian Framework for Evaluating LLM Performance

Background: The Pass@ $k$ Metric and Its Limitations

Evaluation metrics for LLMs aim to quantify performance on tasks like reasoning or programming, but they often struggle to provide reliable relative rankings across models. Pass@ $k$ , for instance, estimates the probability of at least one correct answer within $k$ model attempts (see Appendix 15 for details). While convenient, this metric exhibits high variance [38], particularly when $k$ approaches the total number of trials, $N$ , resulting in unstable rankings [24]. Small fluctuations in correctness can distort comparisons, particularly in benchmarks with few problems or limited computational resources, raising doubts about its suitability for differentiating model capabilities. If a metric cannot consistently distinguish stronger models from weaker ones, its value as a benchmarking tool is undermined [27].

Estimating uncertainty in Pass@ $k$ scores is also challenging, as it lacks closed-form expressions for variance, relying instead on computationally intensive approximations like bootstrapping. A truly effective metric should yield reliable performance rankings with a minimal number of trials, prioritizing both accuracy and efficiency in resource-constrained environments. To address these limitations, we propose a Bayesian evaluation framework that provides more stable estimates of performance, incorporates uncertainty, and facilitates robust relative comparisons across models [34, 39, 40].

Results Matrix

Consider a results matrix $R$ for an LLM evaluated on a test set comprising $M$ questions. Due to the stochastic nature of LLM sampling, responses may vary across independent trials, so we run the LLM $N$ times per question. The outcomes are captured in the $M \times N$ matrix $R$ , where element $R_{α i}$ represents the score in the $i$ th trial for the $α$ th question. This score is an integer ranging from $0$ to a maximum value $C$ , reflecting a rating system with $C + 1$ categories. In the binary case ( $C = 1$ ), 0 indicates an incorrect answer and 1 a correct one, though we accommodate more nuanced rubrics generally.

Weighted Performance Metric

For the $α$ th question, $α = 1, \dots, M$ , there is an underlying probability $π_{α k}$ that the LLM's answer falls in the $k$ th category. We denote $π_{α}$ as the $(C + 1)$ -dimensional vector with elements $π_{α k}$ , $k = 0, \dots, C$ . If all $π_{α}$ were known, we could calculate a desired performance metric $\overset{π}{ˉ}$ as a weighted average over these probabilities:

\overset{π}{ˉ} = \frac{1}{M} α = 1 \sum M w \cdot π_{α} = \frac{1}{M} α = 1 \sum M k = 0 \sum C w_{k} π_{α k},

(1)

where $w$ is a $(C + 1)$ -dimensional vector of constant weights. For example, if $w_{k} = k$ , then $\overset{π}{ˉ}$ represents the average category label. In the case where $C = 1$ , this average corresponds to the mean probability of a correct answer over the entire test set. However, we allow for a general choice of $w$ to accommodate a wide range of possible metrics.

Bayesian Estimator and Uncertainty for the Performance Metric

In principle, we could estimate $π_{α}$ by running an arbitrarily large number of trials with the LLM, yielding an accurate estimate of $\overset{π}{ˉ}$ . However, we are typically constrained to small $N$ due to limited computational resources. Our goal is to develop a Bayesian approach to estimate $\overset{π}{ˉ}$ and its associated uncertainty given a finite $N$ . The first step is to construct $P (π_{α} ∣ R_{α})$ , the posterior probability of $π_{α}$ given the $α$ th row of the matrix $R$ , denoted $R_{α}$ . This posterior depends on the data in $R_{α}$ and a chosen prior distribution $P (π_{α})$ for the unknown underlying probability vector $π_{α}$ . The prior could be uniform (assuming no prior information) or incorporate previously gathered evidence about the LLM's performance. The Bayesian framework focuses on two quantities: the first is the mean of $\overset{π}{ˉ}$ over the joint posterior for all questions, which we denote as $μ (R)$ . This is a Bayesian optimal estimator, minimizing the quadratic loss function $L (\overset{π}{ˉ}^{est}) = E_{R, π_{α}} (\overset{π}{ˉ}^{est} (R) - \overset{π}{ˉ})^{2}$ over all possible estimators $\overset{π}{ˉ}^{est} (R)$ , where the expectation value is over all possible $π_{α}$ and realizations of $R$ [41]. The second quantity is the variance $σ^{2} (R)$ , which quantifies the uncertainty of the $μ$ estimate. Both $μ (R)$ and $σ^{2} (R)$ have exact closed-form expressions, derived in Appendix 10, and can be simply calculated for any $R$ using Algorithm 1.

Algorithm 1. LLM performance evaluation using the Bayes@

N

framework.

Function EvaluatePerformance$R$, $[R^0]$, $w$
State input: $M N$ matrix $R$ of results, with each element $R_ i = 0,,C$
State $$ weight vector $w = (w_0,,w_C)$ defining performance metric $\pi$
State optional input: $M D$ matrix $R^0$ of results for prior; otherwise $D=0$
State output: performance metric estimate $$ and associated uncertainty $$
State $T = 1+C+D+N$
For $ = 1$ to $M$ // Tally results in $R$ and $R^0$
For $k = 0$ to $C$
State $n_ k = _i=1^N _k,R_ i$
State $n^0_ k = 1+_i=1^D _k,R^0_ i$
State $_ k = n^0_ k + n_ k$
EndFor
EndFor
State $ = w_0 + 1M T _=1^M _j=0^C _ j(w_j - w_0)$
State $ = ^1/2$
State return $$, $$
EndFunction

Using Uncertainty Estimates to Decide Significance of Performance Differences

In general, the expressions for $μ (R)$ and $σ^{2} (R)$ are valid for any $M$ and $N$ , and do not rely on asymptotic arguments like the central limit theorem (CLT). However, there are useful simplifications that occur in specific limiting cases. For example as the size of the test set $M$ becomes large, we can derive not just the moments of the posterior distribution for $\overset{π}{ˉ}$ , but also its shape, which becomes approximately Gaussian: $P (\overset{π}{ˉ} ∣ R) \sim N (μ (R), σ^{2} (R))$ . This allows us to assess whether two methods exhibit a statistically significant performance difference. Consider results matrices $R$ and $R^{'}$ from two approaches, with corresponding means $μ$ , $μ^{'}$ and standard deviations $σ$ , $σ^{'}$ . The distribution of the performance difference $Δ \overset{π}{ˉ} \equiv \overset{π}{ˉ} - \overset{π}{ˉ}^{'}$ is a convolution of the individual posteriors, yielding another normal distribution: $P (Δ \overset{π}{ˉ} ∣ R, R^{'}) \sim N (\tilde{μ}, \tilde{σ}^{2})$ , where the mean of the difference is $\tilde{μ} = μ - μ^{'}$ , and the standard deviation is $\tilde{σ} = σ^{2} + (σ^{'})^{2}$ . To determine our confidence in the ranking of the two methods, we need to determine the probability that $sign (Δ \overset{π}{ˉ}) = sign (μ - μ^{'})$ . This can be done by calculating the absolute $z$ -score, $z = ∣ μ - μ^{'} ∣/ σ^{2} + (σ^{'})^{2}$ . The probability that the ranking based on $μ$ and $μ^{'}$ is correct (the ranking confidence $ρ$ ) is given by $ρ = (1/2) (1 + erf (z / 2))$ . For example $z = 1.645$ corresponds to $ρ = 0.95$ .

Equivalence of Bayesian and Average Rankings for Uniform Prior

In the results below, we will denote ranking based on the Bayesian estimator $μ$ with a uniform prior as Bayes@ $N$ . Because $μ$ is related to a naive weighted average accuracy via a positive affine transformation, it turns out the ranking based on the average, denoted as avg@ $N$ , is identical to Bayes@ $N$ (for the detailed proof, see Appendix 11). In the large-trial limit $N \to \infty$ , the value of $μ$ approaches the average, as expected, but the ranking equivalence holds at all finite $N$ . This relationship also extends to uncertainty quantification, where the standard deviation of the average relates to the Bayesian standard deviation $σ$ by a scaling factor, providing a concrete method to compute uncertainty in the average without relying on the Central Limit Theorem. This is particularly advantageous in small-sample regimes common in LLM evaluations, where CLT-based methods often underestimate uncertainty and produce invalid intervals (e.g., extending beyond [0,1] or collapsing to zero) [42]. As highlighted by [42], Bayesian approaches with uniform priors (e.g., Beta(1,1) in the binary case) yield well-calibrated credible intervals even for datasets with fewer than a few hundred datapoints, outperforming CLT approximations in coverage and handling complex structures like clustered data.

Gold Standard for Ranking

Strictly speaking, the underlying true ranking of LLMs for a particular performance metric $\overset{π}{ˉ}$ is unknown, because it would require determining the infinite trial limit, $\overset{π}{ˉ} = lim_{N \to \infty} μ$ , for each LLM. In practice, we have to settle for an approximation to $\overset{π}{ˉ}$ , calculated at some large but finite value $N = N_{max}$ (for example $N_{max} = 80$ in our LLM experiments). Specifically, we use Bayes@ $N_{max}$ - which is the same as the ranking based on avg@ $N_{max}$ - as our "gold standard" or reference ranking [43]. In other words, rankings using smaller $N$ will be compared to this gold standard to assess their accuracy.

For this comparison, we employ Kendall's $τ$ , a nonparametric correlation coefficient that measures ordinal agreement between two rankings by comparing the number of concordant and discordant pairs of models. The coefficient ranges from $- 1$ (perfect inversion) to $+ 1$ (perfect agreement), with $0$ indicating no association. We specifically use the $τ_{b}$ variant, which properly accounts for ties in the rankings (e.g., the intentional tie in our simulation below), ensuring that equivalences do not artificially inflate the correlation. See Appendix 16.1.1 for further discussion and formal definitions.

To validate our claims about the gold standard as Bayes@ $N_{max}$ , specifically to determine which evaluation methods converge to the true ranking, we conduct a simulation using biased coins as a metaphor for LLMs. In this setup, we already know the underlying performance distribution (the success probabilities $π_{α}$ for each question), allowing us to establish a known ground truth $\overset{π}{ˉ}$ . We generate $11$ sets of these $30$ probabilities, with $\overset{π}{ˉ}$ values of $[0.2332, 0.2545, 0.3604, 0.3642, 0.3642, 0.4466, 0.5418, 0.5276, 0.608, 0.6213, 0.7327]$ , representing different LLMs (note the tie at $0.3642$ to test handling of equivalent performances). We run experiments for $M = 30$ questions, where each LLM "answers" all the questions in each trial according to its success probabilities $π_{α 1}$ . Panel (a) of Figure 2 shows results without bootstrapping: we generate 1000 independent $R$ matrices, each with $80$ trials; for each step in the number of trials (from $N = 1$ to $80$ ), we compute scores using Pass@ $k$ ( $k = 2$ , $k = 4$ , and $k = 8$ with an unbiased estimator Equation 21), Bayes@ $N$ , Pass^ $k$ (Equation 22), G-Pass@ $k_{\tilde{τ}}$ (Equation 23 with $\tilde{τ} = 0.5$ ), and mG-Pass@ $k$ (Equation 24), then derive rankings and compare them to the gold standard using Kendall's $τ$ as a measure of rank correlation (where $τ = 1$ indicates perfect alignment with the gold standard), and report the average $τ$ over the $1000$ $R$ matrices. Note that we do not explicitly show average accuracy avg@ $N$ because it is equivalent to Bayes@ $N$ , as discussed in Section 2.6. In practice, we are computationally limited to a small number of trials per question. To examine what happens with only $N = 80$ trials, we apply two methods of bootstrapping with replacement to the $R$ matrix, allowing us to estimate how results differ from the ideal case with a large number of independent $R$ matrices (panel a). For both methods, we generate $10, 000$ bootstrap replicates for each of the $N = 1$ to $80$ trials, derived from a single $R$ matrix. Panels (b) and (c) of Figure 2 illustrate this using two bootstrapping schemes. In the first scheme (panel b, column-wise bootstrapping), we resample trial indices; in the second (panel c, row-wise bootstrapping), we resample answers independently for each question. In both cases, the resulting bootstrap replicates are used to recompute evaluation scores, rankings, and $τ$ values, which are then averaged to produce smoothed convergence curves. The two bootstrapping approaches yield nearly identical behavior, and both closely match the baseline in panel (a). This demonstrates that the $τ$ convergence behavior is robust and not sensitive to the ordering of answers in either the rows or columns of $R$ . Though in our LLM mimic simulations, we do not have to use bootstrapping (since we can easily generate an arbitrarily large number of $R$ matrices), in actual LLM experiments, we have limited trial data, and these results show that bootstrapping provides a viable way of estimating statistical properties like convergence.

As seen in Figure 2, Bayes@ $N$ begins with relatively high agreement with the gold standard and converges much faster to $τ = 1$ than Pass@ $k$ and its variants, which suffer from greater variance and bias at small $N$ . All methods eventually converge to the same ranking, but their rates of convergence differ substantially. This makes the convergence rate a crucial factor when choosing between different LLM evaluation methods.

Potential benefits of non-uniform priors

While the convergence results in figure 2 demonstrate that Bayes@ $N$ with a uniform prior outperforms alternatives like Pass@ $k$ in ranking models, there are scenarios where non-uniform priors can achieve even faster convergence. This is the case when we have data from models that are related or closely correlated to the ones we are ultimately interested in ranking. Potential examples include: i) results from an older version of a model used as a prior for ranking a newer version; ii) a non-quantized version (where running trials is computationally expensive) used to provide prior data for a quantized version (where achieving large $N$ is cheaper); iii) a base model used to provide prior data for a fine-tuned one. Though a full exploration of these kinds of priors will be left to future work, in this section, we will show the potential benefits through our synthetic biased-coin LLM models, introduced in Sec. 2.7.

We start with a set of eight "original" models with $C = 1$ , labeled by $i = 1, \dots, 8$ . Each model $i$ consists of a set of $M = 30$ success probabilities $π_{α 1}$ drawn from a distribution Beta $(i + 3, 12 - i)$ . We fix these probabilities for all the numerical experiments described below, and their averages for the eight models are: $\overset{π}{ˉ} =$ [0.3021, 0.3166, 0.4144, 0.4985, 0.5351, 0.5759, 0.6679, 0.7487]. Hence for the original models higher $i$ corresponds to higher overall accuracy. We now imagine an "update" of model $i$ that mimics some kind of revision, fine-tuning, or other modification. Because the performance of the updated model should be correlated with the original, we model the update as a stochastic perturbation to the Beta distribution from which success probabilities are drawn: for updated model $i$ the $π_{α 1}$ values are drawn from Beta $(i + 3 + σ, 12 - i + σ^{'})$ , where $σ = \pm 1$ and $σ^{'} = \pm 1$ are random integers of unit magnitude. For the updated models the value of $\overset{π}{ˉ}$ may not strictly increase with $i$ , so the ranking of models could be different than the original. Fig. 3(a) shows a histogram of the Kendall's $τ$ values comparing the original model set (described above) and 50k possible updated sets drawn using this stochastic procedure. A $τ$ value of 1 corresponds to exactly the same ranking, and we see that the mean $τ$ over the 50k realizations is 0.88. Hence there is some correlation between the original and updated rankings, but in the vast majority of cases (about 86% of the updates), the ranking has changed for the updated models.

The question we would like to ask is whether we can use the results from the original models as priors to help speed up convergence when ranking the updated models. To employ a non-uniform prior for a given model, we follow the procedure described in Appendix 10, and incorporate the prior via the $M \times D$ results matrix $R^{0}$ corresponding to $D$ trial results over $M$ questions using the original model. Combined with $N$ trial results from the updated model, we get the Bayes@ $N$ accuracy estimate $μ$ for the updated model. These estimates are then used to rank the 8 updated models. Because we know the $\overset{π}{ˉ}$ values for this set, we know the true ranking, and we can compare the estimated and true rankings via Kendall's $τ$ .

For each choice of $N$ and $D$ we run 50k replicates, with each replicate consisting of a set of stochastic updates of the original models. The mean $τ$ values over all these replicates are shown in Fig. 3(b) as a function of $N$ for several different $D$ . As expected, the $τ$ curves increase with $N$ , since the ranking becomes more certain with more trials, but the convergence properties vary. The dashed line is the case of a uniform prior ( $D = 0$ ), while the solid lines represent five different non-uniform prior scenarios, with $D = 1$ , 2, 4, 8, and 16. For small $N$ and small $D \leq 4$ we see a clear benefit of the non-uniform prior: already at $N = 1$ , the value of $τ$ starts higher than the uniform case, and remains so until the latter catches up for $N > 5$ . Thus when we have prior data available, we can extract more accurate rankings with just a small number of trials of the updated model, relative to the uniform case. However there is a possibility to over-emphasize the prior: when $D = 8$ and 16, the benefit for small $N$ turns into a disadvantage at larger $N$ . The $τ$ curves dip beneath the $D = 0$ result, indicating that the prior has impeded accurate ranking. Fig. 3(c) shows these trends more clearly by plotting $Δ τ$ , the difference between the $τ$ for each $D$ and the uniform $τ$ with $D = 0$ . So we see that priors have to be used judiciously, with a large enough $D$ to nudge the ranking in the correct direction, but not too large to outweigh the results from the updated models. One of the goals of our future work will be to establish practical guidelines for $D$ in different real-world use cases.

(a) Histogram of Kendall values comparing original ranking of synthetic LLM models and 50k replicates of updated models. (b) Mean Kendall between the estimated and true ranking for the updated models (50k replicates) as a function of N, the number of trials. The dashed line corresponds estimates using N with a uniform prior (D=0), while the solid lines are N with a non-uniform prior and different choices of D. The non-uniform prior is based on results from D trials of the original models. (c) Same as panel (b), except showing the difference between the non-uniform prior curves and the uniform curve. — **Figure 3.** (a) Histogram of Kendall $τ$ values comparing original ranking of synthetic LLM models and 50k replicates of updated models. (b) Mean Kendall $τ$ between the estimated and true ranking for the updated models (50k replicates) as a function of $N$ , the number of trials. The dashed line corresponds estimates using Bayes@ $N$ with a uniform prior ( $D = 0$ ), while the solid lines are Bayes@ $N$ with a non-uniform prior and different choices of $D$ . The non-uniform prior is based on results from $D$ trials of the original models. (c) Same as panel (b), except showing the difference $Δ τ$ between the non-uniform prior curves and the uniform curve.

Ranking with Uncertainty

In Section 2.5, we described how uncertainty estimates from the Bayesian approach can be used to evaluate the relative performance of two models. Here, we extend these ideas to incorporate uncertainty into the ranking of multiple models. We do this via our biased-coin LLM mimics, which we denote LLM $_{β}$ for $β = 1, \dots, 11$ , described in the previous section. To incorporate a chosen credible interval in the ranking, we order their $μ$ values from highest to lowest, choose the appropriate $z$ threshold (for example $z = 1.645$ for 95% CI in the ranking), and assign two consecutive methods the same ranking if the absolute $z$ -score falls below this threshold.

The first row of Table 1 shows the underlying gold standard ranking for all the LLM mimics, since in this case we know the true $\overset{π}{ˉ}$ values. Note the tie between LLM $_{4}$ and LLM $_{5}$ , because their $\overset{π}{ˉ} = 0.3642$ is the same. The second row shows the Bayes@ $80$ ranking without a credible interval (CI) and the third row shows Bayes@ $80$ incorporating the $95%$ CI. The Bayes@ $80$ ranking without CI aligns with the gold standard, except for two differences: the order of $LLM_{10}$ and $LLM_{9}$ is swapped, and the tie between $LLM_{5}$ and $LLM_{4}$ is not captured, which is expected since this ranking relies solely on $μ$ estimates without accounting for uncertainty $σ$ . In contrast, the third row, which incorporates the CI, reveals multiple ties across several models. Interestingly, $LLM_{10}$ and $LLM_{9}$ are now indistinguishable at the $95%$ CI. Despite the fact that $N = 80$ would be an atypically large number of trials for an actual LLM evaluation, it is insufficient to confidently distinguish the small performance difference ( $\overset{π}{ˉ} = 0.608$ vs. 0.6213) between the two models.

Table 1. Comparison of biased-coin LLM mimic rankings based on the gold standard, Bayes@

80

without credible interval (CI), and Bayes@

80

with CI.

LLM mimic	0LLM $_{11}$	0LLM $_{10}$	0LLM $_{9}$	0LLM $_{8}$	0LLM $_{7}$	0LLM $_{6}$	0LLM $_{5}$	0LLM $_{4}$	0LLM $_{3}$	0LLM $_{2}$	0LLM $_{1}$
Gold Standard	11	22	33	55	44	66	77	77	88	99	1010
Bayes@ $80$ (w/o CI)	11	33	22	55	44	66	77	88	99	1010	1111
Bayes@ $80$ (w/ CI)	11	22	22	33	33	44	55	55	55	66	77

To quantify the trials needed to reliably separate models with closely matched performance, we simulated the probability of correctly ranking $LLM_{10}$ above $LLM_{9}$ as a function of the number of trials $N$ , shown in the left panel of Figure 4. At $N = 80$ , the probability of obtaining the correct ranking is $83.7%$ . The right panel plots the absolute $z$ -score versus $N$ ; at $N = 80$ , the $z \sim 1.14$ , corresponding to approximately $87%$ confidence (though the plots exhibit some noise due to simulation variability). These values closely align with the empirical probabilities in the left panels.

We also determined the minimum sample size $N$ needed to achieve z-scores of $1.645$ and $1.96$ , corresponding to CI of approximately $95%$ and $97.5%$ , respectively, for distinguishing between models. These thresholds occur at about $N = 199$ and $N = 285$ . At these values, the simulated probability of correctly ranking the models is $94.7%$ and $96.9%$ , respectively, which is closely consistent with expectations given the inherent noise in the results. These results underscore the computational cost of distinguishing models whose true performance metrics differ only slightly. In our biased-coin setup, the underlying success probabilities were $\overset{π}{ˉ}_{9} = 0.608$ and $\overset{π}{ˉ}_{10} = 0.6213$ , yet reliably establishing this distinction requires nearly $200$ trials. Such large sample requirements highlight the importance of considering both uncertainty and convergence rates when interpreting ranking-based evaluations.

Experiments

In this section, we empirically validate our proposed evaluation methods using real-world datasets, focusing on ranking LLMs for mathematical reasoning tasks. We employ bootstrapping to compute the expected value of each evaluation score at a given $N$ . First, we present rankings of LLMs on the AIME'24, AIME'25, BrUMO'25, and HMMT'25 datasets without accounting for variance, based solely on evaluation scores (with ties occurring when scores are identical). Subsequently, we demonstrate how incorporating uncertainty in these scores can alter rankings across different datasets. Building on the discussion in Section 2.7, we adopt the ranking derived from avg@ $80$ (equivalently, Pass@ $1$ evaluated on the same 80 trials) or Bayes@ $80$ (uniform-prior Bayesian estimator) as our gold standard for comparing current LLMs, noting their equivalence in rankings (as proven in Section 2.6). For each $N$ from $1$ to $80$ (with Pass@ $k$ and similar methods starting from $N = k$ to avoid computation with insufficient samples), we compare the rankings produced by various evaluation methods against this gold standard, reporting the average Kendall's $τ$ over $1 0^{4}$ bootstrapped resamples to estimate the expected rank correlation at each step (assuming independence among questions and trials).

Convergence to Gold Standard

To assess the ability of different evaluation methods to compare the performance of different LLMs, we plot the average Kendall's $τ$ against the gold standard as a function of the number of trials $N$ in Figure 5, combining results from AIME'25 (panel a), AIME'24 (panel b), HMMT'25 (panel c), and BrUMO'25 (panel d). Across all datasets, the Bayes@ $N$ and avg@ $N$ curves overlap completely (so we only plot Bayes@ $N$ ) and demonstrate the fastest convergence to high $τ$ values, indicating robust alignment with the gold standard even in low-sample regimes. In all four datasets, Bayes@ $N$ reaches $τ > 0.90$ by $N = 10$ and approaches $τ \approx 1$ at $N \approx 80$ . The only exception is AIME'25, where $τ > 0.90$ is achieved by $N = 10$ , but the curve converges to $τ \approx 0.95$ at $N = 80$ .

In contrast, Pass@ $k$ variants ( $k = 2, 4, 8$ ) and their variations (e.g., Pass^ $k$ , G-Pass@ $k_{\tilde{τ}}$ with $\tilde{τ} = 0.5$ , mG-Pass@ $k$ ) start with lower Kendall's $τ$ compared to Bayes@ $N$ and converge more slowly in all four datasets. At every $N$ , Bayes@ $N$ consistently shows faster convergence and higher agreement with the gold standard. These findings align with our biased-coin simulations in Section 2.7, demonstrating that the Bayesian method best satisfies the gold-standard criteria - low uncertainty, minimal ties, and rapid convergence - across diverse mathematical reasoning benchmarks.

Rankings With Credible Intervals

Following the methodology of Section 2.7.2, we compare model rankings across four datasets (AIME'25, AIME'24, HMMT'25, and BrUMO'25) using Bayes@ $80$ as the gold standard (see Figure 5). Table 2 summarizes these comparisons by reporting, for each dataset, two versions of the ranking: the rank with a $95%$ CI and the rank without CI. The "w/ CI" rank accounts for uncertainty in the Bayes@ $80$ scores and therefore allows models with overlapping CIs to share the same rank; the "w/o CI" rank is the strict ordering determined by the point estimates of Bayes@ $80$ for that dataset.

Table 2 indicates that point-estimate rankings diverge from those accounting for credible intervals. Qwen Qwen3-30B-A3B-Thinking-2507 and Qwen Qwen3-4B-Thinking-2507 consistently secure the top positions across all four datasets; specifically, the dominance of the 30B model is statistically distinguishable at the $95%$ CI level in every case. Conversely, the relative ordering of the remaining models varies by dataset.

When incorporating $95%$ CIs, we observe that while all four datasets exhibit five tied groups, the extent of ambiguity varies significantly. AIME'25 yields the fewest distinct ranks (up to 11), followed by AIME'24 (up to 13), and both HMMT'25 and BrUMO'25 (up to 14). This compression of ranks indicates greater uncertainty in the Bayes@ $80$ gold standard for AIME'25 (due to more extensive ties) compared to the others under our current trial budget. Intuitively, this higher uncertainty in AIME'25's gold-standard scores implies that more additional trials would be required for that dataset to empirically produce a statistically stable ranking; conversely, we can be more confident in the estimated gold standards for AIME'24, HMMT'25, and BrUMO'25 given the current number of trials. This distinction also explains why AIME'25 reaches a Kendall’s $τ$ of 0.95 at $N = 80$ , whereas the other three datasets converge to $\sim 1$ at the same sample size in Figure 5.

Table 2. Rankings for four datasets. Models are listed in the order of their gold-standard ranking (Bayes@

80

point estimates, i.e., without uncertainty) for AIME'25. Each dataset column gives the rank with a

95%

CI (left) and the rank without CI (right).

2*Model	2cAIME'25	2cAIME'24	2cHMMT'25	2cBrUMO'25
	0w/ CI	0w/o CI	0w/ CI	0w/o CI	0w/ CI	0w/o CI	0w/ CI	0w/o CI
Qwen Qwen3-30B-A3B-Thinking-2507	11	11	11	11	11	11	11	11
Qwen Qwen3-4B-Thinking-2507	22	22	22	22	22	22	22	22
gpt-oss gpt-oss-20b-high	33	33	33	55	33	44	66	1111
gpt-oss gpt-oss-20b-medium	33	44	33	33	22	33	77	1212
Microsoft Phi-4-reasoning-plus	33	55	33	44	33	55	33	55
NVIDIA AceReason-Nemotron-1.1-7B	44	66	55	99	44	66	33	44
Microsoft Phi-4-reasoning	55	77	55	1010	55	88	44	77
gpt-oss gpt-oss-20b-low	55	88	66	1212	1111	1717	1111	1717
OpenThinker OpenThinker2-32B	55	99	44	88	55	77	22	33
Light-R1 Light-R1-14B-DS	55	1010	44	66	66	1111	44	88
FuseO1 FuseO1-DeepSeekR1-QwQ-SkyT1-Flash-32B	55	1111	44	77	66	99	33	66
NVIDIA NVIDIA-Nemotron-Nano-9B-v2	66	1212	66	1111	66	1010	55	1010
LIMO LIMO-v2	66	1313	77	1313	77	1212	55	99
EXAONE EXAONE-4.0-1.2B	77	1414	88	1414	77	1313	1010	1515
OpenR1 OpenR1-Distill-7B	77	1515	99	1515	1010	1616	88	1313
OpenThinker OpenThinker3-1.5B	88	1616	1010	1616	88	1414	99	1414
NVIDIA OpenReasoning-Nemotron-1.5B	88	1717	1111	1717	99	1515	1010	1616
DeepSeek DeepSeek-R1-Distill-Qwen-1.5B	99	1818	1212	1919	1212	1818	1313	1919
Sky-T1-32B-Flash Sky-T1-32B-Flash	1010	1919	1212	1818	1313	1919	1212	1818
Bespoke-Stratos Bespoke-Stratos-7B	1111	2020	1313	2020	1414	2020	1414	2020

Convergence

In this section, we investigate the convergence of model rankings, building on the showcase figure (Figure 1). We define convergence@ $n$ as the smallest trial $n$ at which the ranking induced by the first $n$ trials matches the gold standard ranking from all 80 trials (without bootstrapping) and remains unchanged thereafter.

Lower convergence@ $n$ values indicate that fewer trials are sufficient to achieve stable rankings. As detailed in the caption of Figure 1, the figure displays the probability mass functions (PMFs) of convergence@ $n$ for each method across the datasets. These PMFs are empirically estimated by generating $1 0^{5}$ column-wise bootstrap replicates through resampling the $N_{m a x}$ trials, then for each replicate, cumulatively evaluating the ranking at every $N$ (from 1 to 80) and identifying the minimal $n$ where the ranking stabilizes to the gold standard. This process captures the distribution of convergence points under repeated sampling, reflecting the inherent uncertainty in finite-sample rankings due to stochastic trial outcomes.

This bootstrapping approach provides a distribution over possible convergence points ( $n$ ), offering insights into the variability and reliability of each evaluation method: Pass@ $k$ (for $k = 2, 4, 8$ ) versus our Bayes@ $N$ . A lower mean convergence@ $n$ signifies more cost-effective convergence, while failure to converge within $80$ trials (as seen in AIME'25) indicates more trials are needed to confidently rank LLMs or we must include CI for a reliable ranking.

The key takeaways from Figure 1, as summarized in its caption, highlight the advantages of Bayes@ $N$ : it converges reliably on all datasets except AIME'25, often with fewer trials than Pass@ $k$ . For instance, on HMMT'25 and BrUMO'25, Bayes@ $N$ achieves mean convergence at approximately $44.2$ and $27.1$ trials, respectively, compared to around $69.5$ and $48.5$ for the best-performing Pass@ $k$ scores. The right panel of the figure further illustrates this through an example ranking from a bootstrap replicate, emphasizing differences in convergence for AIME'25 and BrUMO'25. See Appendix 17 (Figure 11) for the corresponding cumulative distribution functions (CDFs).

Worst-case scenarios

To further distinguish the Bayes@ $N$ framework from avg@ $N$ , we analyze the worst-case bootstrap replicates, i.e., those that either require the maximum number of trials to stabilize the rankings or fail to converge. For 11 LLMs, Figure 8 shows these trajectories as competition rankings, with each line tracing a model’s rank as trials accumulate; convergence is defined as the point at which the ranking order remains unchanged for all subsequent trials. In AIME'24 the ranking converges at trial 75, in HMMT'25 at trial 78, and in BrUMO'25 at trial 68, whereas in AIME'25 no convergence is observed within 80 trials, underscoring persistent instability and the need for additional trials or Bayes@ $N$ 's credible intervals. When a ranking does not converge within the trial budget (as for AIME'25 in Figure 1) only Bayes@ $N$ can be used to quantify uncertainty and estimate the minimum $N$ required for a reliable ranking (see Section 2.7.2).

This situation becomes even more severe as more models are included. As shown in Figure 9, when the number of models is increased to $L = 20$ , none of the datasets exhibit convergence. To examine convergence as a function of $L$ more systematically, we consider a pool of 20 LLMs (Table 7) and construct 50 subsets of 5 models (?), 20 subsets of 10 models (?), and 20 subsets of 15 models (?). For each subset, we generate $1 0^{5}$ bootstrap replicates to estimate convergence@ $n$ . Figure 6 reports the resulting convergence@ $n$ values across all subsets and replicates, showing that as the number of models increases, evaluation methods such as avg@ $N$ and the Pass@ $k$ family become unreliable for estimating model abilities and producing stable rankings.

Convergence@n without CI. Mean convergence@n across model combinations for AIME'24, AIME'25, HMMT'25, and BrUMO'25. Top: 50 combinations of 5 models. Bottom-left: 20 combinations of 10 models. Bottom-right: 20 combinations of 15 models. Color indicates the mean convergence@n over 10^5 bootstrap replicates (green: fast convergence; red: slow convergence). — **Figure 6.** Convergence@ $n$ without CI. Mean convergence@ $n$ across model combinations for AIME'24, AIME'25, HMMT'25, and BrUMO'25. **Top**: 50 combinations of 5 models. **Bottom-left**: 20 combinations of 10 models. **Bottom-right**: 20 combinations of 15 models. Color indicates the mean convergence@ $n$ over $1 0^{5}$ bootstrap replicates (green: fast convergence; red: slow convergence).

Rubric-Aware Categorical Evaluation

While evaluation is often reduced to binary correctness, this simplification discards valuable signals that capture other aspects of model behavior. For instance, LLM outputs can be assessed not only on correctness but also on whether they are well-structured, coherent, or exhibit step-by-step reasoning in mathematical tasks. In practice, evaluators could record richer dimensions such as format compliance, calibration of confidence, degenerate outputs, out-of-distribution (OOD) behavior, and verifier scores. This limitation is especially important for reasoning models, where overthinking [44] inflates token usage without corresponding gains in reliability. Bayes@ $N$ provides a principled way to capture these richer outcomes. By treating per-item results as categorical rather than binary, the approach aligns more closely with actual goals while preserving statistical rigor and transparency.

Concretely, for each question and trial, a set of base signals is logged (e.g., correctness, presence of a boxed final answer, length, and perplexity features). These base signals are then augmented with probabilistic labels from a lightweight reward model (e.g., calibrated probabilities of correct, wrong, or off-task) [45]. From these signals, rubric variables are defined (Table 4) and different categorical schemata are instantiated (Table 5), mapping each attempt into one of $C + 1$ categories. Under Bayes@ $N$ , the resulting category counts induce a Dirichlet posterior, and a rubric is specified by the weight vector $w$ in algorithm 1. Different choices of schema and $w$ encode different evaluation preferences (e.g., stricter compliance, stronger penalties for confidently wrong answers, or efficiency-adjusted scoring). This procedure yields posterior means and credible intervals for each rubric.

figure 10 summarizes aggregated results across tasks. The leader Qwen Qwen3-30B-A3B-Thinking-2507 ranks first under all selected schemata, but the gap to rank 2 depends on the rubric: it is largest under Conf-Wrong Penalty and smallest under Verifier-Only. Mid-pack reorderings are rubric-sensitive: under Verifier Prob, OpenThinker OpenThinker2-32B edges gpt-oss gpt-oss-20b_medium; under calibration-heavy schemata (e.g., Conf-Calibrated and Format+Confidence), gpt-oss gpt-oss-20b_high overtakes OpenThinker OpenThinker2-32B; and OOD Robustness narrows the gap between ranks 2 and 3. Several categories (Format Aware, Length-Robust, and Strict Compliance) agree closely, indicating that once correctness is accounted for, formatting and length rarely flip the top ranks. In contrast, calibration-focused categories emphasize and penalize confidently wrong behavior, and efficiency-oriented categories favor concision. The lower tier is stable across categories (EXAONE EXAONE-4.0-1.2B, OpenThinker OpenThinker3-1.5B, NVIDIA OpenReasoning-Nemotron-1.5B, Sky-T1-32B-Flash Sky-T1-32B-Flash, DeepSeek DeepSeek-R1-Distill-Qwen-1.5B), suggesting rubric choice primarily reshuffles the middle while preserving extremes. Overall, the categorical schemata surface complementary facets - format compliance, calibration, efficiency, OOD robustness, and verifier alignment - making rubric-dependent differences explicit and enabling compute-efficient, uncertainty-aware comparisons aligned with evaluation goals. For a comprehensive discussion of the categorical Bayesian evaluation framework, including base signals, schema definitions, and their impact on model rankings, see Appendix 13.

Table 3. Comparison of the Bayesian framework and other evaluation methods.

Methods ( $N$ trials)	Convergence	Credible interval	Prior knowledge	Categorical
Pass@ $k$ and alternatives	No	No	No	No
avg@ $N$	Yes	Limited (via bootstrap/binomial CIs)	No	No
Bayes@ $N$	Yes (Sec. 3.3, figure 1, figure 6)	Yes (figure 4, table 1, table 2)	Yes (Sec. 2.7.1)	Yes (Sec. 3.4)

Functional-correctness evaluation with Pass@ $k$ became standard in code generation with HumanEval (OpenAI Codex): generate $k$ samples, a task is solved if any sample passes unit tests, and estimate the overall rate with an unbiased estimator that requires producing $n \geq k$ samples per task [24]. Although Pass@ $k$ was initially introduced in the context of coding, it later became the de facto choice to evaluate LLMs not only on math reasoning tasks [46, 47, 48, 49, 50, 51, 52, 53, 27, 54] but also on safety evaluations spanning agent red-teaming, jailbreaks, and backdoor analyses [55, 56, 57, 58, 59, 60]. For a broader review of these metrics and their variants, see Appendix 15. Beyond standard Pass@ $k$ , pass^ $k$ quantifies reliability across $k$ i.i.d. trials for agents, while the generalized G-pass@k $_{τ}$ continuum (and its area-under- $τ$ summary mG-Pass) jointly assess potential and stability in reasoning outputs [61, 27].

Efforts like HELM advance holistic, transparent evaluation across scenarios and metrics [5], while practice guidelines distill reproducibility pitfalls and prescribe multi-run, uncertainty-aware reporting with fixed prompts, decoding, and dataset [62]. The LM Evaluation Harness offers standardized, reproducible frameworks to implement these recommendations [62]. It supports uncertainty reporting through binomial-style uncertainty estimates for binary mean metrics and bootstrap estimates for others.

The last category of related work focuses on measuring uncertainty in LLM evaluation. These works converge on interval-aware, small-sample-valid reporting rather than CLT/Wald error bars. Bowyer et al. show that CLT-based intervals miscalibrate on small benchmarks and advocate small- $n$ -appropriate frequentist or Bayesian intervals for reliable comparisons [42]. A Bayesian alternative models capability as a latent success probability and reports posterior uncertainty that remains informative with limited trials, yielding more stable rankings [34]. In judge-based settings, Judging LLMs on a Simplex places model and judge behavior on the probability simplex, enabling uncertainty-aware comparisons and highlighting how distributional structure matters for evaluation [63]. Beyond bespoke LLM metrics, prediction-powered inference supplies general procedures for valid confidence intervals that leverage model predictions to reduce labeled-sample requirements [64]. Finally, in adjacent retrieval evaluation with LLM-generated assessments, Oosterhuis et al. construct reliable confidence intervals and demonstrate that calibrated uncertainty, rather than point estimates, should guide decisions, reinforcing this shift for LLM evaluation more broadly [65].

Conclusion: Strengths, Limitations & Future Directions

The overall benefits of the Bayesian framework are summarized in Table 3: it provides fast convergence, analytical uncertainty estimates, and the incorporation of prior knowledge and categorical results. However, it is worth noting that our approach quantifies statistical uncertainty from finite samples; it does not fix dataset bias, distribution shift, or rubric misspecification. Results therefore depend on the chosen benchmark, prompts, and inference settings (hardware). Although we have validated our approach with biased-coin LLM mimic simulations, together with experiments using actual LLMs (up to $N_{max} = 80$ trials across four tasks and 20 models), more extensive evaluations may be constrained by computing and academic budgets.

The focus of the current work was the simplest version of the Bayesian approach, using a uniform prior, which provides a conservative and reproducible starting point. But the theory allows for more complex, informative priors, and this opens up a rich vein of future directions that should be systematically explored: for example priors from past runs, domain- or task-conditioned priors, and expert-elicited priors. These have the potential of accelerating convergence even further, but must be chosen and reported carefully. Clear guidance and tools for prior elicitation will hopefully ensure that gains in sample efficiency do not come at the cost of hidden bias.

Ethics Statement

This research relies only on publicly available, non-personal benchmarks; no human subjects, user data, or PII are involved. Potential misuse includes cherry-picking priors, rubrics, or samples to exaggerate performance. To prevent this, use of Bayes@ $N$ with user-defined priors requires clear documentation and reporting of posterior credible intervals.

Reproducibility Statement

To ensure reproducibility, detailed implementation instructions are provided in Appendix 16.

Acknowledgments

This research was supported in part by NSF awards 2117439, 2112606, and 2320952.

Appendix

Derivation of Bayesian Estimator and Uncertainty

As described in the main text, the Bayesian framework is built on two quantities. The first is $μ (R)$ , the average of $\overset{π}{ˉ}$ over the joint posterior for all the questions:

μ (R) = \int_{Δ} d π_{1} \dots \int_{Δ} d π_{M} \overset{π}{ˉ} α = 1 \prod M P (π_{α} ∣ R_{α}),

(2)

where the integration region $Δ$ is the probability simplex defined as the set of all possible $(C + 1)$ -dimensional vectors $p$ such that $\sum_{k = 0}^{C} p_{k} = 1$ . The second is the variance $σ^{2} (R)$ associated with our Bayesian estimator,

σ^{2} (R) = \int_{Δ} d π_{1} \dots \int_{Δ} d π_{M} (\overset{π}{ˉ} - μ (R))^{2} α = 1 \prod M P (π_{α} ∣ R_{α}) .

(3)

Our derivation of closed-form expressions for $μ$ and $σ$ builds on the generalized ( $C > 1$ ) and original ( $C = 1$ ) Laplace rule of succession theory from [41], recovering those results in the special case of a single question ( $M = 1$ ). We start with Bayes' rule for each row of $R$ :

P (π_{α} ∣ R_{α}) = \frac{P ( R _{α} ∣ π _{α} ) P ( π _{α} )}{P ( R _{α} )} .

(4)

The likelihood $P (R_{α} ∣ π_{α})$ is a $(C + 1)$ -category multinomial distribution over $N$ trials, with the probability distribution function:

P (R_{α} ∣ π_{α}) = \frac{N !}{n _{α 0} ! n _{α 1} ! \dots n _{α C} !} k = 0 \prod C (π_{α k})^{n_{α k}},

(5)

where $n_{α k} = \sum_{i = 1}^{N} δ_{k, R_{α i}}$ , $n_{α}$ is the vector with elements $n_{α k}$ , and $δ_{i, j}$ is the Kronecker delta.

The prior $P (π_{α})$ is chosen as the conjugate prior of the multinomial, a Dirichlet distribution $P (π_{α}) \sim Dir (n_{α}^{0})$ , with concentration parameter vector $n_{α}^{0} = (n_{α 0}^{0}, \dots, n_{α C}^{0})$ . [35] A uniform prior (no prior knowledge) sets $n_{α k}^{0} = 1$ for all $k$ . Prior information from an earlier $M \times D$ matrix $R^{0}$ (with $R_{α i}^{0}$ as the category for the $i$ th trial of the $α$ th question) can be incorporated as:

n_{α k}^{0} = 1 + i = 1 \sum D δ_{k, R_{α i}^{0}} .

(6)

The Dirichlet prior is:

P (π_{α}) = \frac{Γ ( 1 + C + D )}{\prod _{k = 0}^{C} Γ ( n _{α k}^{0} )} k = 0 \prod C (π_{α k})^{n_{α k}^{0} - 1},

(7)

where $\sum_{k = 0}^{C} n_{α k}^{0} = 1 + C + D$ .

The normalization constant $P (R_{α})$ is:

P (R_{α}) = \int_{Δ} d p P (R_{α} ∣ p) P (p),

(8)

and since the Dirichlet is the conjugate prior, the posterior is $P (π_{α} ∣ R_{α}) \sim Dir (ν_{α})$ , with $ν_{α} = n_{α} + n_{α}^{0}$ . The posterior distribution is:

P (π_{α} ∣ R_{α}) = \frac{Γ ( T )}{\prod _{k = 0}^{C} Γ ( ν _{α k} )} k = 0 \prod C (π_{α k})^{ν_{α k} - 1},

(9)

where $T \equiv \sum_{k = 0}^{C} ν_{α k} = 1 + C + D + N$ .

The moment generating function $Φ (t) = ⟨ exp (\overset{π}{ˉ} t)⟩$ is:

Φ (t) = \int_{Δ} d π_{1} \dots \int_{Δ} d π_{M} exp (t \overset{π}{ˉ}) α = 1 \prod M P (π_{α} ∣ R_{α}) = α = 1 \prod M \int_{Δ} d π_{α} exp (\frac{t}{M} k = 0 \sum C w_{k} π_{α k}) P (π_{α} ∣ R_{α}) = e^{t w_{0}} α = 1 \prod M \int_{Δ} d π_{α} exp (t k = 1 \sum C s_{k} π_{α k}) P (π_{α} ∣ R_{α}),

(10)

where $s_{k} \equiv (w_{k} - w_{0}) / M$ , and $π_{α 0} = 1 - \sum_{k = 1}^{C} π_{α k}$ .

Each integral is the moment-generating function for a Dirichlet distribution, expressed via the confluent Lauricella hypergeometric function $Ψ^{[C]}$ :

Φ (t) = e^{t w_{0}} α = 1 \prod M Ψ^{[C]} (ν_{α 1}, \dots, ν_{α C}; T; t s_{1}, \dots, t s_{C}),

(11)

where

Ψ^{[C]} (ν_{α 1}, \dots, ν_{α C}; T; t s_{1}, \dots, t s_{C}) = m_{1} = 0 \sum \infty \dots m_{C} = 0 \sum \infty \frac{( ν _{α 1} ) _{m_{1}} \dots ( ν _{α C} ) _{m_{C}} ( t s _{1} ) ^{m_{1}} \dots ( t s _{C} ) ^{m_{C}}}{( T ) _{m} m _{1} ! \dots m _{C} !},

(12)

and $(x)_{n}$ is the Pochhammer symbol.

The moments are:

μ = Φ^{'} (0), σ^{2} = Φ^{''} (0) - (Φ^{'} (0))^{2} .

(13)

Expanding $Ψ^{[C]}$ to $O (t^{2})$ :

Ψ^{[C]} = 1 + \frac{t}{T} j = 1 \sum C ν_{α j} s_{j} + \frac{t ^{2}}{2 T ( T + 1 )} j = 1 \sum C ν_{α j} (ν_{α j} + 1) s_{j}^{2} + \frac{t ^{2}}{T ( T + 1 )} ℓ = 1 \sum C m = ℓ + 1 \sum C ν_{α ℓ} ν_{α m} s_{ℓ} s_{m} + O (t^{3}) .

(14)

Substituting into equation 11 and computing derivatives yields:

μ σ^{2} = w_{0} + \frac{1}{M T} α = 1 \sum M j = 0 \sum C ν_{α j} (w_{j} - w_{0}), = \frac{1}{M ^{2} ( T + 1 )} α = 1 \sum M ⎩ ⎨ ⎧ j = 0 \sum C \frac{ν _{α j}}{T} (w_{j} - w_{0})^{2} - (j = 0 \sum C \frac{ν _{α j}}{T} (w_{j} - w_{0}))^{2} ⎭ ⎬ ⎫ .

(15)

The algorithm summarizing this calculation is shown in Algorithm 1 in the main text.

Proof of Equivalence of Bayesian and Average Rankings for Uniform Prior

For Bayesian estimators using a uniform prior (where $D = 0$ , $T = 1 + C + N$ , $ν_{α k} = 1 + n_{α k}$ ), the expression for the mean $μ$ from equation 15 simplifies as:

μ = w_{0} + \frac{1}{M ( 1 + C + N )} α = 1 \sum M j = 0 \sum C (1 + n_{α j}) (w_{j} - w_{0}) = A + \frac{1}{M ( 1 + C + N )} α = 1 \sum M j = 0 \sum C w_{j} n_{α j},

(16)

where the constant $A$ is given by

A = \frac{1}{1 + C + N} j = 0 \sum C w_{j},

(17)

and $\sum_{j = 0}^{C} n_{α j} = N$ . Here, $μ$ relates to a naive weighted average accuracy $a$ over the number of answers in each category,

a = \frac{1}{M N} α = 1 \sum M j = 0 \sum C w_{j} n_{α j},

(18)

via

μ = A + \frac{N}{1 + C + N} a .

(19)

Note that in the binary case where $C = 1$ , $w_{0} = 0$ , $w_{1} = 1$ , the value of $a$ is just regular average accuracy avg@ $N$ . For categorical cases, it is just a weighted generalization of avg@ $N$ .

Since $A$ is constant across models and the prefactor $\frac{N}{1 + C + N}$ is positive, we see that if $μ > μ^{'}$ , the corresponding values of $a$ and $a^{'}$ from the two methods must always give the same ranking, $a > a^{'}$ . Additionally, in the limit of a large number of trials, $N \to \infty$ , we see that $A \to 0$ and $μ \approx a$ , as expected.

This equivalence extends to uncertainty quantification. The relationship between the standard deviation of the average ( $σ_{avg@ N}$ ) and the Bayesian standard deviation ( $σ_{Bayes@ N}$ from equation 15) is

σ_{avg@ N} = \frac{1 + C + N}{N} σ_{Bayes@ N} .

(20)

The Bayesian expression for $σ_{Bayes@ N}$ is valid for all $M$ and $N$ , providing a reliable method to compute uncertainty in avg@ $N$ without relying on the Central Limit Theorem.

Runtime

To see the asymptotic runtime and memory scaling let:

M = number of problems (rows), N = number of trials per problem (columns in R), D = number of prior outcomes per problem (columns in R_{0}, which may be 0), C + 1 = number of categories .

From Algorithm 1, the work is:

Two row-wise histograms: O (M N) for R + O (M D) for R_{0},

Posterior mean and variance on ν \in R^{M \times (C + 1)} : O (M (C + 1)) .

So the overall time complexity is:

O (M (N + D + C))

i.e., linear in the number of entries in the result matrices and linear in the number of categories.

The memory footprint is likewise linear:

Store R and (optionally) R_{0} : O (M N + M D),

Store per-row category counts and derived arrays (ν, ν / T): O (M (C + 1)) .

Note that the evaluation consists of tallying counts and then plugging them into closed-form expressions for $μ$ and $σ$ ; no iterative optimization or Monte Carlo sampling is required.

Categorical Evaluation

Rubric-aware Bayes@N Evaluation of Reasoning Models

As discussed in Section 2.3, Section 3.4, for each question $α \in 1, \dots, M$ , every attempt yields base signals such as has_box, is_correct, token_ratio, prompt_bpt, completion_bpt, and verifier probabilities compass_context_A, compass_context_B, and compass_context_C for correct, wrong, and invalid/off-task. Using thresholds and Boolean criteria, each attempt is mapped into one of $C + 1$ categories under a chosen schema (e.g., Format Aware, Conf-Wrong Penalty, Efficiency-Adjusted; Table 5). We instantiate categorical schemata and update posterior means via Dirichlet-multinomial inference, yielding metrics that preserve correctness while explicitly reflecting formatting, calibration, and efficiency.

Base signals

All signals are directly obtainable from common LLM inference stacks such as Hugging Face transformers [66] and vLLM [16], via per-step scores/log-probs and termination metadata, and require no model-specific instrumentation; the verifier probabilities compass_context_A, compass_context_B, and compass_context_C are defined in Section 13.1.2.

has_box: 1 if a final boxed answer is present; else 0.
is_correct: 1 if the answer is correct; else 0.
token_ratio: completion tokens normalized by 32,768.
repeated_pattern: 0 if finish_reason is stop; else 1 (degenerate output).
prompt_bpt: negative average prompt log-prob in bits/token (lower is better).
completion_bpt: negative average completion log-prob in bits/token (lower is better).
compass_context_A: verifier contextual probability of correct.
compass_context_B: verifier contextual probability of wrong.
compass_context_C: verifier contextual probability of irrelevant/off-task.

Reward models in evaluation.

While reward models are most familiar from fine-tuning (e.g., RLHF), we use one as a lightweight verifier to supply per-attempt label probabilities for

{co m p a ss_co n t e x t_A, co m p a ss_co n t e x t_B, co m p a ss_co n t e x t_C} = {correct, wrong, invalid/off-task}

in evaluation. Concretely, we employ OpenCompass CompassVerifier-3B to produce probabilities and then apply contextual calibration to obtain a more robust, prompt-stable label distribution: we evaluate next-token scores for the candidate labels at a fixed answer slot, subtract a content-free baseline logit $b_{y}$ from the task logit $s_{y}$ for each label $y$ , and apply temperature scaling to yield calibrated probabilities

p (y ∣ x) = softmax (\frac{s _{y} - b _{y}}{T}) .

This helps us mitigate saturation and the entanglement of formatting and confidence seen with last-token probabilities, and improves probability calibration for downstream rubric scoring.

Selected categorical schema.

We define 12 schemata (Table 5) using the rubric variables (Table 4) derived from the base signals; here are two illustrative definitions (the others follow analogously):

Format Aware: \[ cat = cases 0 & invalid \\ 1 & wrong unboxed \\ 2 & wrong boxed \\ 3 & correct unboxed \\ 4 & correct boxed cases \]
Conf-Wrong Penalty: \[ cat = cases 0 & invalid \\ 1 & wrong_high_conf \\ 2 & wrong low_conf \\ 3 & correct cases \]

Rubric weights $w$ are chosen to reflect evaluation preferences. For example, Format Aware might use $[0, 0, 1, 2, 3]$ to mildly reward formatting when correct and slightly penalize confidently wrong (via schema choice); Efficiency-Adjusted can downweight verbose outputs among both correct and wrong categories.

Exact Match Correctness only; ignores formatting, confidence, and length.
Format Aware Rewards boxed, well-formatted answers; distinguishes boxed/unboxed even when wrong.
Conf-Calibrated Penalizes confidently wrong; grades correct answers by confidence (low/mid/high).
OOD Robustness Separates in-distribution vs. OOD prompts; checks correctness under both.
Strict Compliance Requires boxed final answers; unboxed-correct is treated as non-compliant.
Conf-Wrong Penalty Heavier penalty for wrong answers at high confidence; lighter when uncertain.
Verifier-Only Uses verifier signals alone to rank; model-agnostic prob of the verifier.
Format+Confidence Balanced composite over (boxed/unboxed) $\times$ (low/high confidence) for both wrong and correct; emphasizes boxed, high-confidence correctness and penalizes confidently wrong.
Length-Robust Isolates correctness irrespective of verbosity; does not penalize length.
Verifier Prob Probes agreement with the verifier: flags wrong with high verifier A as inconsistent and distinguishes under/over-confidence on correct.
Efficiency-Adjusted Rewards short, correct completions; penalizes verbose outputs (especially when wrong).
Concision-High-Conf Prefers concise, high-confidence correct answers; downweights verbose correctness.

Table 4. Rubric variables, decision formulas, and brief descriptions used to map each model attempt into discrete categories. Thresholds (

τ_{high}, τ_{low_wrong}, τ_{prompt}

) and length quantiles (

len_p33, len_p66

) are computed per dataset from observed bits-per-token and token-ratio statistics. Category

0

is reserved for invalid outputs (degenerate repetition or high verifier

co m p a ss_co n t e x t_C

), and

co m p a ss_co n t e x t_A, co m p a ss_co n t e x t_B, co m p a ss_co n t e x t_C

denote calibrated verifier probabilities for correct, wrong, and off-task, respectively.

lll@ Rubric variables	Formula	Description
invalid	$(repeated_pattern = 1) \lor (co m p a ss_co n t e x t_C \geq 0.5)$	Category 0 reserved for invalid.
correct	$(is_correct \geq 0.5)$	Boolean mask of correctness.
wrong	$(is_correct < 0.5)$	Complement of correct.
high_conf	$(completion_bpt \leq τ_{high})$	Confidence proxy
low_conf	$(completion_bpt > τ_{high})$	Complement of high_conf.
wrong_high_conf	wrong $\land$ $(completion_bpt \leq τ_{low_wrong})$	Penalize confidently wrong.
ood	$(prompt_bpt \geq τ_{prompt})$	Out-of-distribution prompt.
ind	$(prompt_bpt < τ_{prompt})$	In-distribution prompt.
economical	$(token_ratio \leq len_p33)$	Short completions.
moderate	$(len_p33 < token_ratio \leq len_p66)$	Medium-length completions.
verbose	$(token_ratio > len_p66)$	Long completions.
boxed	$(has_box \geq 0.5)$	Answer is boxed.
unboxed	$(has_box < 0.5)$	Answer is not boxed.
A_high	$(co m p a ss_co n t e x t_A \geq 0.6)$	Verifier confidence high.
$τ_{high}$	40th percentile of $completion_bpt$
$τ_{low_wrong}$	60th percentile of $completion_bpt$ among wrong items
$τ_{prompt}$	90th percentile of $prompt_bpt$
$len_p33, len_p66$	33rd and 66th percentiles of $token_ratio$
$corr_p33, corr_p66$	33rd and 66th percentiles of $completion_bpt$ correct items

Table 5. Definitions of the twelve categorical evaluation schemata used in our Dirichlet–multinomial framework. Each schema specifies decision rules over correctness, formatting (boxed/unboxed), confidence (via

completion_bpt

), prompt distribution (in-distribution vs. OOD), output economy (via

token_ratio

), and verifier signals

(A, B, C)

. These rules map every attempt into

C + 1

discrete categories, enabling posterior means and credible intervals for any chosen weight vector

w

.

Categorical Schema	Rubric
Exact Match	0 invalid; 1 wrong; 2 correct
Format Aware	0 invalid; 1 wrong $\land$ unboxed; 2 wrong $\land$ boxed; 3 correct $\land$ unboxed; 4 correct $\land$ boxed
Conf-Calibrated	0 invalid; 1 wrong $\land$ low_conf; 2 wrong_high_conf; 3 correct $\land$ low_conf; 4 correct $\land$ mid; 5 correct $\land$ high_conf
OOD Robustness	0 invalid; 1 ood $\land$ wrong; 2 ind $\land$ wrong; 3 ood $\land$ correct; 4 ind $\land$ correct
Strict Compliance	0 invalid; 1 wrong $\lor$ (correct $\land$ unboxed); 2 correct $\land$ boxed
Conf-Wrong Penalty	0 invalid; 1 wrong_high_conf; 2 wrong $\land$ low_conf; 3 correct
Verifier-Only	0 invalid; 1 high C; 2 high B; 3 A_high
Format+Confidence	0 invalid; 1 wrong $\land$ unboxed; 2 wrong $\land$ boxed $\land$ low_conf; 3 wrong $\land$ boxed $\land$ high_conf; 4 correct $\land$ unboxed $\land$ low_conf; 5 correct $\land$ unboxed $\land$ high_conf; 6 correct $\land$ boxed $\land$ low_conf; 7 correct $\land$ boxed $\land$ high_conf
Length-Robust	0 invalid; 1 wrong; 2 correct
Verifier Prob	0 invalid; 1 wrong $\land$ A_high; 2 wrong $\land$ $\neg$ A_high; 3 correct $\land$ $\neg$ A_high; 4 correct $\land$ A_high
Efficiency-Adjusted	0 invalid; 1 wrong $\land$ economical; 2 wrong $\land$ moderate; 3 wrong $\land$ verbose; 4 correct $\land$ economical; 5 correct $\land$ moderate; 6 correct $\land$ verbose
Concision-High-Conf	0 invalid; 1 wrong; 2 correct $\land$ verbose; 3 correct $\land$ moderate; 4 correct $\land$ economical; 5 correct $\land$ economical $\land$ high_conf

Domain-agnostic rubric-aware Bayes@N

The Bayesian construction is intentionally domain-agnostic: it applies whenever model outputs can be mapped into a finite set of categories equipped with a rubric. The evaluator specifies

a mapping from raw outputs (and any side information) to categorical labels $R_{α i} \in {0, \dots, C}$ , and
a weight vector $w$ that encodes how those categories are valued.

Given these choices, Bayes@N returns the posterior mean $μ (R)$ as a rubric-aware point estimate, and $σ (R)$ as an uncertainty estimate, for any such categorical evaluation.

This viewpoint naturally covers subjective tasks. For instance:

In summarization, each response could be rated ${bad, okay, good, excellent}$ or by multi-criteria scores such as faithfulness, coverage, style, and harmful content. Each discrete level becomes a category index $k$ , and $w_{k}$ reflects the importance of that level or criterion.
In dialogue safety, categories might distinguish ${unsafe, borderline, safe}$ , or finer-grained notions such as policy violations vs. merely over-cautious refusals.

Once the labels are available (from humans or an LLM-as-a-judge), Bayes@N provides Bayesian estimates and credible intervals for any chosen rubric-based score, reusing the same closed-form posterior as in the binary case.

Two aspects are particularly promising for future work in such subjective domains:

Preference-based evaluation with rubrics. When model comparisons are driven by preferences (either from human experts or LLM judges), each comparison can be converted into categorical labels over rubric dimensions (e.g., faithfulness, verbosity, harmfulness). A downstream weight vector $w$ can then fold these dimensions into a single scalar score that reflects application-specific trade-offs.
Transferring prior evidence across related tasks. The optional prior matrix $R_{0}$ in Algorithm 1 lets us encode earlier outcome frequencies as a Dirichlet prior. For example, if a summarization system has been evaluated on a news dataset, the empirical category counts on that dataset can serve as prior counts when evaluating a closely related dataset. This allows stable rubric distributions to be reused across adjacent tasks or benchmark revisions, while still updating with new data.

An important limitation in subjective settings is that Bayes@N does not resolve disagreement or bias in the rubric or labeling process itself. The framework assumes a labeling scheme (from humans or an LLM-based judge) and a weight vector $w$ are given; it then provides a statistically principled way to aggregate those labels and quantify uncertainty. Designing good rubrics and calibrating judges remain separate modeling decisions.

`Scorio`

Alongside this paper, we release Scorio, an open-source Python package that implements the evaluation framework presented in this work. Scorio provides a simple, unified API for computing Bayes@ $N$ , avg@ $N$ , Pass@ $k$ , and their credible intervals, enabling researchers to adopt principled Bayesian evaluation with minimal effort. The package is available on PyPI and its documentation is hosted at https://scorio.readthedocs.io.

Installation.

Scorio can be installed via:

pip install scorio

Basic usage.

All evaluation functions operate on a results matrix $R \in {0, \dots, C}^{M \times N}$ , where $M$ is the number of problems, $N$ is the number of trials per problem, and $C + 1$ is the number of outcome categories. An optional weight vector $w$ of length $C + 1$ maps each category to a score. Listing 2 shows binary evaluation using both the Bayesian estimator and Pass@ $k$ .

Listing 2. Binary evaluation with Scorio.

import numpy as np
from scorio import eval

# Binary outcomes: M=2 problems, N=5 trials each
R = np.array([[0, 1, 1, 0, 1],
              [1, 1, 0, 1, 1]])

# Bayesian evaluation (binary: w defaults to (0, 1))
mu, sigma = eval.bayes(R)
print(f"Bayes@5: mu={mu:.4f}, sigma={sigma:.4f}")

# Average accuracy
a, sigma_a = eval.avg(R)
print(f"avg@5:   mu={a:.4f}, sigma={sigma_a:.4f}")

# Pass@k
print(f"Pass@1 = {eval.pass_at_k(R, k=1):.4f}")
print(f"Pass@2 = {eval.pass_at_k(R, k=2):.4f}")

Credible intervals.

Each estimator has a companion _ci function that returns the posterior mean, standard deviation, and a credible interval, as shown in Listing 3.

Listing 3. Computing credible intervals.

# 95
mu, sigma, lo, hi = eval.bayes_ci(R, confidence=0.95)
print(f"Bayes@5: {mu:.4f} [{lo:.4f}, {hi:.4f}]")

# 95
mu, sigma, lo, hi = eval.pass_at_k_ci(R, k=1)
print(f"Pass@1:  {mu:.4f} [{lo:.4f}, {hi:.4f}]")

Categorical (rubric-based) evaluation.

For graded outcomes with $C > 1$ categories, a weight vector $w$ specifies the score associated with each category. Listing 4 illustrates evaluation under a three-level rubric ( $C = 2$ ) with partial credit.

Listing 4. Categorical evaluation with a weighted rubric.

# Graded outcomes: 0=incorrect, 1=partial, 2=correct
R = np.array([[0, 2, 1, 0, 2],
              [2, 1, 1, 2, 1]])

# Weight vector: incorrect=0, partial=0.5, correct=1
w = np.array([0.0, 0.5, 1.0])

mu, sigma = eval.bayes(R, w)
print(f"Bayes@5 (graded): mu={mu:.4f}, sigma={sigma:.4f}")

mu, sigma, lo, hi = eval.bayes_ci(R, w, confidence=0.95)
print(f"95

Incorporating prior evidence.

When prior evaluation data are available (e.g., from a previous benchmark or a pilot study), they can be passed as a prior matrix $R^{0}$ to inform the posterior, as described in Section 2.7.1. Listing 5 shows a minimal example in which a short pilot run is reused as prior evidence for a new evaluation.

Listing 5. Using prior evidence.

# Prior outcomes from a pilot study (M=2, D=3 trials)
R0 = np.array([[1, 0, 1],
               [0, 1, 0]])

mu, sigma = eval.bayes(R, w=None, R0=R0)
print(f"Bayes@5 (with prior): mu={mu:.4f}, sigma={sigma:.4f}")

Table 6 summarizes the main Scorio API.

Table 6. Summary of the Scorio evaluation API. All functions accept a results matrix

R \in {0, \dots, C}^{M \times N}

. Functions with the _ci suffix additionally return a credible interval.

lll@ Function	Returns	Description
`bayes(R, w, R0)`	$(μ, σ)$	Bayesian posterior mean and uncertainty
`bayes_ci(R, w, R0)`	$(μ, σ, lo, hi)$	+ credible interval
`avg(R, w)`	$(a, σ_{a})$	Weighted average and uncertainty
`avg_ci(R, w)`	$(a, σ_{a}, lo, hi)$	+ credible interval
`pass_at_k(R, k)`	$p$	Pass@ $k$ estimate
`pass_at_k_ci(R, k)`	$(μ, σ, lo, hi)$	+ credible interval
`pass_hat_k(R, k)`	$p$	Pass\^ $k$
`g_pass_at_k_tau(R, k, tau)`	$p$	G-Pass@ $k_{\tilde{τ}}$
`mg_pass_at_k(R, k)`	$p$	mG-Pass@ $k$

The evaluation of LLMs in generative reasoning tasks, under test-time scaling (e.g., via repeated sampling[67]), has evolved to address the stochastic nature of inference and the need for robust measures of functional correctness. Early approaches relied on syntactic similarity metrics like BLEU [68] and CodeBLEU [69], which compare generated answers against reference solutions. However, these metrics often fail to capture semantic correctness in reasoning tasks, motivating metrics based on execution-validation or test-based validation [70, 69]. This limitation has shifted focus toward functional evaluation, where the generated solution is assessed via a ground truth to verify correctness[70, 71]. In this section, we review key functional metrics, focusing on those that leverage multiple samples to scale performance at inference time. These metrics form the basis to assess LLM capabilities but often overlook probabilistic uncertainty or consistency across samples, motivating our novel Bayesian framework.

The Pass@ $k$ metric, originally introduced by [70, 24] for evaluating LLMs trained on code. It measures the probability that at least one of $k$ independently generated samples for a given problem passes all associated unit tests (i.e., by matching ground-truth answers or satisfying logical constraints), offering a practical estimate of a model's potential performance in solving a variety of complex tasks and problems. The unbiased estimator of Pass@ $k$ is computed as:

Pass@ k = E_{problems} [1 - \frac{( k n - c )}{( k n )}],

(21)

where $n$ is the total number of generated samples and $c$ is the total number of correct solutions within the $n$ trials. This estimator has smaller uncertainty in the limit of $n ≫ k$ , ensuring reliable approximations. However, due to computational costs, $k$ is often comparable to $n$ in practice, which can increase variance and weaken evaluation stability. The Pass@ $k$ metric has been adapted beyond code to evaluate LLMs in various tasks requiring verifiable correctness, such as math, logic, and general reasoning [71, 72, 73, 74].

Pass^ $k$ , introduced in [61], extends the Pass@ $k$ metric to capture both the potential performance and the consistency of LLMs in reasoning tasks, where evaluating the reliability and stability of generated solutions is crucial. Pass^ $k$ is defined as the probability that all $k$ trials are correct:

Pass^k = E_{problems} [\frac{( k c )}{( k n )}],

(22)

where $c$ and $n$ retain the same meanings as in Pass@ $k$ . This metric assumes that all the trials are independent and uniformly distributed, approximating the binomial distribution with a hypergeometric distribution to account for sampling without replacement. By requiring all $k$ samples to be correct, Pass^ $k$ provides a stringent measure of model consistency and stability.

To introduce flexibility, [27] proposed G-Pass@ $k_{\tilde{τ}}$ , which incorporates a tolerance threshold $\tilde{τ} \in (0.0, 1.0]$ :

G-Pass@ k_{\tilde{τ}} = E_{problems} j = ⌈ τ \cdot k ⌉ \sum c \frac{( j c ) \cdot ( k - j n - c )}{( k n )},

(23)

where $⌈ τ \cdot k ⌉$ is the smallest integer greater than or equal to $τ \cdot k$ . This formulation allows up to $k - ⌈ τ \cdot k ⌉$ incorrect solutions, balancing the assessment of potential with consistency. As a special case, Pass@ $k$ corresponds to G-Pass@ $k_{\tilde{τ}}$ in the limit $τ \to 0$ .

Furthermore, [27] introduced mG-Pass@ $k$ , an interpolated metric that integrates G-Pass@ $k_{\tilde{τ}}$ over $τ \in [0.5, 1.0]$ :

mG-Pass@ k = 2 \int_{0.5}^{1.0} G-Pass@ k_{τ} d τ \approx \frac{2}{k} i = ⌈ 0.5 \cdot k ⌉ + 1 \sum k G-Pass@ k_{i / k},

(24)

providing a more comprehensive measure that jointly reflects performance potential and reasoning stability.

These extended metrics have been applied to mathematical reasoning benchmarks such as LiveMathBench, MATH, and AIME, where they reveal substantial performance degradation of LLMs under stricter stability requirements.

Experiment Setup and Reproducibility

Metrics

Kendall's Tau:

Kendall's tau ( $τ$ ) [75] is a nonparametric rank correlation coefficient that quantifies the ordinal relationship between two ranked sets by evaluating the consistency in their orderings. For two rankings of $n$ items, it examines all unique pairs $(i, j)$ where $i < j$ :

A pair is concordant if the relative ordering of items $i$ and $j$ is the same in both rankings (both place $i$ before $j$ or vice versa).
A pair is discordant if the relative ordering is different.
Pairs with ties in either ranking are neither concordant nor discordant.

Define $n_{c}$ as the number of concordant pairs, $n_{d}$ as the number of discordant pairs, and $n_{0} = n (n - 1) /2$ as the total number of unique pairs. Let $n_{1}$ represent the number of tied pairs in the first ranking, and $n_{2}$ similarly for the second ranking. The two common variants are the following:

Tau-a: τ_{a} Tau-b: τ_{b} = \frac{n _{c} - n _{d}}{n _{0}} (no adjustment for ties), = \frac{n _{c} - n _{d}}{( n _{0} - n _{1} ) ( n _{0} - n _{2} )} (adjusts for ties in both rankings) .

(25)

Tau-a assumes no ties and may underestimate correlation when ties occur. Tau-b, which corrects for ties, is better suited for datasets with equivalent rankings.

In our implementation, we use scipy.stats.kendalltau with its default variant='b', which computes $τ_{b}$ efficiently and handles ties appropriately. The coefficient ranges from $- 1$ (perfect disagreement) to $+ 1$ (perfect agreement), with $0$ indicating no association. This metric provides a robust, distribution-free measure for comparing model performance rankings, particularly when ties reflect meaningful equivalences.

Convergence@ $n$ .

For a given bootstrap replicate, we measure convergence in terms of an exact ranking match. At each step $s \in {1, \dots, N_{m a x}}$ , we compute the ranking induced by the first $s$ trials and compare it to a gold-standard ranking (obtained from all $N_{m a x}$ trials). We then define

s^{⋆} = min {s \leq N_{m a x} - 1 the ranking after s trials matches the gold-standard ranking, and remains unchanged after every subsequent trial},

and refer to $s^{⋆}$ as the convergence@ $n$ value for that replicate. If no such $s^{⋆} \leq N_{m a x}$ exists, we declare that replicate to exhibit no convergence.

Models and Datasets

Datasets.

We evaluate on four math-reasoning test sets: AIME'24 [25], AIME'25 [26], BrUMO'25 [37], and HMMT'25 [36]. AIME is administered by the Mathematical Association of America and consists of two sets of 15 integer-answer problems; we use the 2024 and 2025 problem sets. For HMMT'25, we use the officially posted February 2025 contest set (algebra, geometry, number theory, and combinatorics). For BrUMO'25, we use the published 2025 problem sets from the tournament archive.

Models.

Unless noted otherwise, we run each generator with the provider-recommended chat template (DeepSeek/Qwen style when unspecified) and identical decoding settings (below) to minimize template-induced variance. The base model cohort includes 11 models (8 distinct models + 3 modes (low, medium, and high) of gpt-oss) as follows: Sky-T1-32B-Flash Sky-T1-32B-Flash [76] (reasoning-optimized “flash” variant tied to overthinking-reduction work), Qwen Qwen3-30B-A3B-Thinking-2507 [77] (Qwen3 series, reasoning variant), DeepSeek DeepSeek-R1-Distill-Qwen-1.5B [46] (distilled reasoning model), gpt-oss gpt-oss-20b [78] (OpenAI open-weight reasoning model; we use the default quantization, MXFP4, and, for prompting, rely on OpenAI Harmony, which defines three levels of reasoning effort), LIMO LIMO-v2 [79] (data-efficient reasoning fine-tuned on curated traces), EXAONE EXAONE-4.0-1.2B [80] (hybrid non-reasoning/reasoning modes), NVIDIA OpenReasoning-Nemotron-1.5B [81, 82, 83, 84] (open-weight small reasoning model), OpenThinker OpenThinker2-32B [85] and OpenThinker OpenThinker3-1.5B [85] (trained on OpenThoughts2/3 data recipes).

To investigate the effect of the number of models required to reach a stable ranking with and without credible intervals, in addition to the 11 above-mentioned models, we extend the evaluation to 20 models in total (17 + 3): Microsoft Phi-4-reasoning and Microsoft Phi-4-reasoning-plus [86] (14B small language models with supervised “teachable” reasoning traces and an RL-enhanced variant), OpenR1 OpenR1-Distill-7B [87] (an open 7B distillation of DeepSeek-R1 using fully public data), FuseO1 FuseO1-DeepSeekR1-QwQ-SkyT1-Flash-32B-Preview [88] (System-II “long-short” reasoning fusion of DeepSeek-R1, QwQ, and Sky-T1-32B-Flash), Light-R1 Light-R1-14B-DS [89] (a Qwen2.5-based long-chain-of-thought model further improved with GRPO-style reinforcement learning), NVIDIA AceReason-Nemotron-1.1-7B [90] (7B NVIDIA Nemotron math/code model trained on OpenMathReasoning/OpenCodeReasoning data), NVIDIA NVIDIA-Nemotron-Nano-9B-v2 [91] (a hybrid Mamba-Transformer “Nano 2” model with controllable reasoning mode), Qwen Qwen3-4B-Thinking-2507 [77] (4B “thinking” variant of Qwen3 with scaled reasoning depth), and Bespoke-Stratos Bespoke-Stratos-7B [92] (Qwen2.5-7B student obtained via DeepSeek-R1-based reasoning distillation on Bespoke-Stratos-17k).

For verification, we additionally use CompassVerifier CompassVerifier-3B [45], a lightweight answer verifier suitable for outcome reward and equivalence checking.

Table 7. Mapping between model IDs, full model names, and the shortened names used in figures and legends. Corresponding subsets are listed in Tables ?, ?, and ?.

ID	Model	Short name
1	DeepSeek DeepSeek-R1-Distill-Qwen-1.5B	DS-R1-Qwen
2	LIMO LIMO-v2	LIMO-v2
3	OpenThinker OpenThinker2-32B	OpenThinker2
4	OpenThinker OpenThinker3-1.5B	OpenThinker3
5	Qwen Qwen3-30B-A3B-Thinking-2507	Qwen3-Thinking
6	Sky-T1-32B-Flash Sky-T1-32B-Flash	Sky-T1-Flash
7	gpt-oss gpt-oss-20b_high	gpt-oss-high
8	gpt-oss gpt-oss-20b_low	gpt-oss-low
9	gpt-oss gpt-oss-20b_medium	gpt-oss-medium
10	EXAONE EXAONE-4.0-1.2B	EXAONE-4.0
11	NVIDIA OpenReasoning-Nemotron-1.5B	OR-Nemotron
12	Microsoft Phi-4-reasoning	Phi-4
13	Microsoft Phi-4-reasoning-plus	Phi-4-plus
14	OpenR1 OpenR1-Distill-7B	OR1-Distill
15	FuseO1 FuseO1-DeepSeekR1-QwQ-SkyT1-Flash-32B-Preview	FuseO1-DS-QwQ-SkyT1
16	Light-R1 Light-R1-14B-DS	Light-R1-DS
17	NVIDIA AceReason-Nemotron-1.1-7B	AR-Nemotron
18	NVIDIA NVIDIA-Nemotron-Nano-9B-v2	NVIDIA-Nemotron
19	Qwen Qwen3-4B-Thinking-2507	Qwen3-4B
20	Bespoke-Stratos Bespoke-Stratos-7B	Bespoke

Prompting.

For most models, we follow the provider-recommended DeepSeek/Qwen-style prompt: "Please reason step by step, and put your final answer within boxed\\." For gpt-oss gpt-oss-20b, we instead use the OpenAI Harmony prompt template, which provides three levels of reasoning effort. For NVIDIA OpenReasoning-Nemotron-1.5B, we adopt the task-specific prompt: "Solve the following math problem. Make sure to put the answer (and only the answer) inside boxed\\."

Reproducibility

Sampling setup. All trials use top- $p$ sampling with temperature $0.6$ , $p = 0.95$ , batch size $1$ , and seeds $1234$ - $1313$ . We perform $N = 80$ trials per dataset $\times$ model.

Verifier. We use CompassVerifier CompassVerifier-3B as a reward model. During evaluation, we leverage the model's scores on prompts generated by other models to create categorical schemas. We rely on the Transformers [66] and Accelerate [93] libraries. To maximize throughput, we enable FlashAttention kernels [23] and adopt the DFloat11 format [94].

Serving stack. Token generation is served with vLLM (PagedAttention) [16], and models are loaded in bf16 unless the release requires MXFP4 (e.g., gpt-oss). We record log-probabilities for both the input prompt and generated tokens, and cap max_tokens at $32, 768$ .

Hardware. All runs execute on clusters with $8 \times$ NVIDIA H200 (141GB).

Computational Cost and Token Statistics

Across all tasks, we evaluated 20 models with 80 trials per model and 30 questions per benchmark, yielding a total of 192,000 independent inference runs. This required 7,445 GPU-hours ( $\sim$ 310 GPU-days) and generated 2.96B tokens (2,963,318,176) in total (see Figure 7 for details).

Task-level computational cost.

Table 8. Task-level computational cost aggregated over 20 models, 80 trials, 4 tasks, and 30 questions per task. Token counts correspond to completion tokens only.

Task	Inference Time (hours)	Completion Tokens (M)
AIME'24	1,699.4	680.0
AIME'25	1,878.4	728.3
HMMT'25	2,216.5	851.2
BrUMO'25	1,650.9	666.9
TOTAL	7,445.2	2,926.4

HMMT'25 is the most expensive benchmark in terms of GPU time (2,217 GPU-hours), while BrUMO'25 is the least expensive (1,651 GPU-hours). Figure 7 provides a complementary visualization of these patterns, showing inference time and completion-token usage across models and tasks.

Token breakdown.

Aggregating across all tasks and models, the total number of tokens (prompt + completion) is 2.96B. The breakdown is:

Prompt tokens: 37M (1.2%)
Completion tokens: 2.93B (98.8%)
Average per query: 15,434 tokens

GPU-hours by model efficiency.

The 20 model configurations varied substantially in computational efficiency:

Most efficient: gpt-oss-20b-low (48.4 GPU-hours for 9,600 queries)
Least efficient: LIMO-v2 (894.3 GPU-hours for 9,600 queries)
Average per query over all models: 139.6 seconds ( $\sim$ 2.3 minutes)

Computational cost analysis. (Left) Total inference time in hours aggregated over 80 trials and 30 questions per benchmark (2 , 400 inference runs per cell). (Right) Total number of completion tokens (in thousands) generated across the same runs. Models are ordered by overall performance (best to worst, top to bottom). — **Figure 7.** **Computational cost analysis.** (**Left**) Total inference time in hours aggregated over 80 trials and 30 questions per benchmark (2,400 inference runs per cell). (**Right**) Total number of completion tokens (in thousands) generated across the same runs. Models are ordered by overall performance (best to worst, top to bottom).

Convergence

While Figure 1 shows the PMF of convergence@ $n$ , Figure 11 shows the corresponding cumulative distribution functions (CDFs). For Pass@ $4$ and Pass@ $8$ , there is no convergence, as the figure shows no CDFs associated with them. The CDFs are computed using the same bootstrap replicates as in Figure 1. The distribution of convergence@ $n$ is computed using the result matrices $R$ from the first 11 models (Table 7). Among the $1 0^{5}$ replications, Figure 8 shows the worst-case scenarios in which convergence@ $n$ attains its maximum value. As discussed in Section 3.3.1, convergence@ $n$ depends on the number of models $L$ : as $L$ increases, convergence@ $n$ grows. When we extend the pool of LLMs from 11 to 20 models, convergence@ $n$ reaches no convergence for all datasets (see Figure 9).

Worst-case bootstrap rank trajectories. Each line shows the ranking of a model as trials are added (20 models in total). Convergence is defined as the minimal N after which the ranking remains unchanged. There is at least one no convergence replicate among the 10^5 bootstrapped replications. — **Figure 9.** **Worst-case bootstrap rank trajectories**. Each line shows the ranking of a model as trials are added (20 models in total). Convergence is defined as the minimal $N$ after which the ranking remains unchanged. There is at least one *no convergence* replicate among the $1 0^{5}$ bootstrapped replications.

Sensitivity of model rankings to the categorical scoring schema. For each schema variant (x-axis; see ), models are assigned a competition rank (y-axis; 1 = best). Colored trajectories track each model’s rank as the rubric changes, highlighting rank stability and crossovers. — **Figure 10.** Sensitivity of model rankings to the categorical scoring schema. For each schema variant (x-axis; see table 5), models are assigned a competition rank (y-axis; 1 = best). Colored trajectories track each model’s rank as the rubric changes, highlighting rank stability and crossovers.

CDF of convergence@n. Complementing the PMFs in , these CDFs plot P(k n) for the convergence threshold k across AIME'24, AIME'25, HMMT'25, and BrUMO'25. Steeper and earlier rises indicate faster convergence. N accumulates mass with fewer trials than Pass@2/4/8, and on AIME'24/'25 the Pass curves do not reach 1 by N_ = 80. Greater convergence suggests that credible intervals should be reported for the evaluation tasks. — **Figure 11.** CDF of convergence@ $n$ . Complementing the PMFs in Figure 1, these CDFs plot $P (k \leq n)$ for the convergence threshold $k$ across AIME'24, AIME'25, HMMT'25, and BrUMO'25. Steeper and earlier rises indicate faster convergence. Bayes@ $N$ accumulates mass with fewer trials than Pass@2/4/8, and on AIME'24/'25 the Pass curves do not reach 1 by $N_{m a x} = 80$ . Greater convergence suggests that credible intervals should be reported for the evaluation tasks.

To complement the worst-case trajectories discussed in Section 3.3.1 and shown in Figure 8, Figure 9, we provide additional details on the construction of the model subsets and the resulting convergence behavior. Table 7 lists the pool of 20 LLMs used in this analysis, together with the shortened identifiers that appear throughout the figures and tables. From this pool we construct 50 subsets of 5 models, 20 subsets of 10 models, and 20 subsets of 15 models, as summarized in ?, ?, ?. Each row in these tables corresponds to one subset, indicating which models are included and reporting, under each task, the convergence@ $n$ metric computed without a credible interval; each entry is the mean over $1 0^{5}$ bootstrap replicates. Thus, the tables make explicit how convergence@ $n$ depends not only on the task but also on the particular mixture of models being compared. Aggregating across all subsets and replicates, Figure 6 then visualizes the distribution of convergence@ $n$ as a function of the number of models $L$ , confirming the trend anticipated in the main text: as $L$ grows from 5 to 15 and ultimately to the full set of 20 LLMs, the required number of trials increases and non-convergence becomes common, indicating that rank-based evaluation methods such as avg@ $N$ and the Pass@ $k$ family become increasingly unreliable without an accompanying Bayesian uncertainty quantification such as Bayes@ $N$ .

tables/comb5 tables/comb10 tables/comb15

References

Vaswani, Ashish, Shazeer, Noam, Parmar, Niki, Uszkoreit, Jakob, Jones, Llion, Gomez, Aidan N., Kaiser, Lukasz, Polosukhin, Illia. Attention Is All You Need. Advances in Neural Information Processing Systems. 2017. https://arxiv.org/abs/1706.03762
Brown, Tom, Mann, Benjamin, Ryder, Nick, Subbiah, Melanie, Kaplan, Jared D., Dhariwal, Prafulla, Neelakantan, Arvind, Shyam, Pranav, Sastry, Girish, Askell, Amanda, Agarwal, Sandhini, Herbert-Voss, Ariel, Krueger, Gretchen, Henighan, Tom, Child, Rewon, Ramesh, Aditya, Ziegler, Daniel, Wu, Jeffrey, Winter, Clemens, Hesse, Chris, Chen, Mark, Sigler, Eric, Litwin, Mateusz, Gray, Scott, Chess, Benjamin, Clark, Jack, Berner, Christopher, McCandlish, Sam, Radford, Alec, Sutskever, Ilya, Amodei, Dario. Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems. 2020. https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
Stack Overflow. Stack Overflow Developer Survey 2025: AI and Developer Tools. misc. 2025. https://survey.stackoverflow.co/2025/ai
Maslej, Nestor, Fattorini, Loredana, Perrault, Raymond, Gil, Yolanda, Parli, Vanessa, Kariuki, Njenga, Capstick, Emily, Reuel, Anka, Brynjolfsson, Erik, Etchemendy, John, Ligett, Katrina, Lyons, Terah, Manyika, James, Niebles, Juan Carlos, Shoham, Yoav, Wald, Russell, Walsh, Toby, Hamrah, Armin, Santarlasci, Lapo, Betts Lotufo, Julia, Rome, Alexandra, Shi, Andrew, Oak, Sukrut. Artificial Intelligence Index Report 2025. arXiv preprint arXiv:2504.07139. 2025. https://arxiv.org/abs/2504.07139
Liang, Percy, Bommasani, Rishi, Lee, Tony, Tsipras, Dimitris, Soylu, Dilara, Yasunaga, Michihiro, Zhang, Yian, Narayanan, Deepak, Wu, Yuhuai, Kumar, Ananya, others. Holistic Evaluation of Language Models. arXiv preprint arXiv:2211.09110. 2022. https://arxiv.org/abs/2211.09110
Hendrycks, Dan, Burns, Collin, Basart, Steven, Zou, Andy, Mazeika, Mantas, Song, Dawn, Steinhardt, Jacob. Measuring Massive Multitask Language Understanding. International Conference on Learning Representations (ICLR). 2021. https://arxiv.org/abs/2009.03300
Srivastava, Aarohi, Rastogi, Abhinav, Rao, Abhishek, Shoeb, Abu Awal Md, Abid, Abubakar, Fisch, Adam, Brown, Adam R., Santoro, Adam, Gupta, Aditya, Garriga-Alonso, Adri\`a, others. Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models (BIG-bench). arXiv preprint arXiv:2206.04615. 2022. https://arxiv.org/abs/2206.04615
Kaplan, Jared, McCandlish, Sam, Henighan, Tom, Brown, Tom B., Chess, Benjamin, Child, Rewon, Gray, Scott, Radford, Alec, Wu, Jeffrey, Amodei, Dario. Scaling Laws for Neural Language Models. arXiv preprint arXiv:2001.08361. 2020. https://arxiv.org/abs/2001.08361
Hoffmann, Jordan, Borgeaud, Sebastian, Mensch, Arthur, Buchatskaya, Elena, Cai, Trevor, Rutherford, Eliza, de Las Casas, Diego, Hendricks, Lisa Anne, Welbl, Johannes, Clark, Aidan, Hennigan, Tom, Noland, Eric, Millican, Katie, van den Driessche, George, Damoc, Bogdan, Guy, Aurelia, Osindero, Simon, Simonyan, Karen, Elsen, Erich, Rae, Jack W., Vinyals, Oriol, Sifre, Laurent. Training Compute-Optimal Large Language Models. arXiv preprint arXiv:2203.15556. 2022. https://arxiv.org/abs/2203.15556
Wei, Jason, Wang, Xuezhi, Schuurmans, Dale, Bosma, Maarten, Ichter, Brian, Xia, Fei, Chi, Ed H., Le, Quoc V., Zhou, Denny. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Advances in Neural Information Processing Systems. 2022. https://openreview.net/forum?id=%5FVjQlMeSB%5FJ
Ouyang, Long, Wu, Jeffrey, Jiang, Xu, Almeida, Diogo, Wainwright, Carroll, Mishkin, Pamela, Zhang, Chong, Agarwal, Sandhini, Slama, Katarina, Ray, Alex, Schulman, John, Hilton, Jacob, Kelton, Fraser, Miller, Luke, Simens, Maddie, Askell, Amanda, Welinder, Peter, Christiano, Paul F., Leike, Jan, Lowe, Ryan. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems. 2022. https://proceedings.neurips.cc/paper%5Ffiles/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf
Tim Dettmers, Mike Lewis, Younes Belkada, Luke Zettlemoyer. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. misc. 2022. https://arxiv.org/abs/2208.07339
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. misc. 2022. https://arxiv.org/abs/2210.17323
Song Han, Jeff Pool, John Tran, William Dally. Learning both Weights and Connections for Efficient Neural Networks. NeurIPS. 2015. https://papers.nips.cc/paper/5784-learning-both-weights-and-connections-for-efficient-neural-network
Geoffrey Hinton, Oriol Vinyals, Jeff Dean. Distilling the Knowledge in a Neural Network. misc. 2015. https://arxiv.org/abs/1503.02531
Kwon, Woosuk, Li, Zhuohan, Zhuang, Siyuan, Sheng, Ying, Zheng, Lianmin, Yu, Cody Hao, Gonzalez, Joseph, Zhang, Hao, Stoica, Ion. Efficient memory management for large language model serving with pagedattention. Proceedings of the 29th symposium on operating systems principles. 2023. https://arxiv.org/abs/2309.06180
Zhang, Tianyi, Yi, Jonah, Xu, Zhaozhuo, Shrivastava, Anshumali. KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization. Advances in Neural Information Processing Systems. 2024. https://proceedings.neurips.cc/paper%5Ffiles/paper/2024/file/05d6b5b6901fb57d2c287e1d3ce6d63c-Paper-Conference.pdf
Zhang, Hailin, Ji, Xiaodong, Chen, Yilin, Fu, Fangcheng, Miao, Xupeng, Nie, Xiaonan, Chen, Weipeng, Cui, Bin. PQCache: Product Quantization-based KVCache for Long Context LLM Inference. Proc. ACM Manag. Data. 2025. https://doi.org/10.1145/3725338
Hariri, Mohsen, Luo, Alan, Chen, Weicong, Zhong, Shaochen, Zhang, Tianyi, Wang, Qifan, Hu, Xia, Han, Xiaotian, Chaudhary, Vipin. Quantize What Counts: More for Keys, Less for Values. The 64th Annual Meeting of the Association for Computational Linguistics (Findings). 2026. https://openreview.net/forum?id=vMIlB97WV1
Hu, Edward J., Shen, Yelong, Wallis, Phillip, Allen-Zhu, Zeyuan, Li, Yuanzhi, Wang, Shean, Wang, Lu, Chen, Weizhu. LoRA: Low-Rank Adaptation of Large Language Models. arXiv preprint arXiv:2106.09685. 2021. https://arxiv.org/abs/2106.09685
Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, Dario Amodei. Deep Reinforcement Learning from Human Preferences. NeurIPS. 2017. https://arxiv.org/abs/1706.03741
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, Yejin Choi. The Curious Case of Neural Text Degeneration. ICLR. 2020. https://openreview.net/forum?id=rygGQyrFvH
Dao, Tri. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691. 2023. https://arxiv.org/abs/2307.08691
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, Wojciech Zaremba. Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374. 2021. https://arxiv.org/abs/2107.03374
Mathematical Association of America. American Invitational Mathematics Examination (AIME). misc. 2024. https://maa.org/maa-invitational-competitions/
Mathematical Association of America. American Invitational Mathematics Examination (AIME). misc. 2025. https://maa.org/maa-invitational-competitions/
Junnan Liu, Hongwei Liu, Linchen Xiao, Ziyi Wang, Kuikun Liu, Songyang Gao, Wenwei Zhang, Songyang Zhang, Kai Chen. Are Your LLMs Capable of Stable Reasoning?. arXiv preprint arXiv:2412.13147. 2024. https://arxiv.org/abs/2412.13147
Hochlehnert, Andreas, Bhatnagar, Hardik, Udandarao, Vishaal, Albanie, Samuel, Prabhu, Ameya, Bethge, Matthias. A sober look at progress in language model reasoning: Pitfalls and paths to reproducibility. arXiv preprint arXiv:2504.07086. 2025. https://arxiv.org/abs/2504.07086
Dror, Rotem, Baumer, Gili, Shlomov, Segev, Reichart, Roi. The Hitchhiker's Guide to Testing Statistical Significance in Natural Language Processing. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. https://aclanthology.org/P18-1128/
Yeh, Alexander. More Accurate Tests for the Statistical Significance of Result Differences. COLING. 2000. https://aclanthology.org/C00-2137/
Dodge, Jesse, Gururangan, Suchin, Card, Dallas, Schwartz, Roy, Smith, Noah A.. Show Your Work: Improved Reporting of Experimental Results. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. https://aclanthology.org/D19-1224/
Zheng, Lianmin, Chiang, Wei-Lin, Sheng, Ying, Zhuang, Siyuan, Wu, Zhanghao, Zhuang, Yonghao, Lin, Zi, Li, Zhuohan, Li, Dacheng, Xing, Eric P., Zhang, Hao, Gonzalez, Joseph E., Stoica, Ion. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv preprint arXiv:2306.05685. 2023. https://arxiv.org/abs/2306.05685
Chen, Guiming Hardy, Chen, Shunian, Liu, Ziche, Jiang, Feng, Wang, Benyou. Humans or LLMs as the Judge? A Study on Judgement Bias. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. https://aclanthology.org/2024.emnlp-main.474/
Xiao, Xiao, Su, Yu, Zhang, Sijing, Chen, Zhang, Chen, Yadong, Liu, Tian. Confidence in Large Language Model Evaluation: A Bayesian Approach to Limited-Sample Challenges. arXiv preprint arXiv:2504.21303. 2025. https://arxiv.org/abs/2504.21303
Hayden, Dustin, Armitage, Thomas. Straightforward Bayesian A/B testing with Dirichlet posteriors. arXiv preprint arXiv:2508.08077. 2025. https://arxiv.org/abs/2508.08077
Harvard - MIT Mathematics Tournament. HMMT February 2025 Archive (Problems and Solutions). misc. 2025. https://www.hmmt.org/www/archive/282
Brown University Math Olympiad Organizers. Brown University Math Olympiad (BrUMO). misc. 2025. https://www.brumo.org/tournament-info
Dalal, Uri, Segal, Meirav, Ben-Haim, Zvika, Lahav, Dan, Nevo, Omer. Leveraging LLM Inconsistency to Boost Pass@ k Performance. arXiv preprint arXiv:2505.12938. 2025. https://arxiv.org/abs/2505.12938
Ross, Brendan Leigh, Vouitsis, No\"el, Ghomi, Atiyeh Ashari, Hosseinzadeh, Rasa, Xin, Ji, Liu, Zhaoyan, Sui, Yi, Hou, Shiyi, Leung, Kin Kwan, Loaiza-Ganem, Gabriel, Cresswell, Jesse C.. Textual Bayes: Quantifying Prompt Uncertainty in LLM-Based Systems. arXiv preprint arXiv:2506.10060. 2025. https://arxiv.org/abs/2506.10060
Vashurin, Roman, Goloburda, Maiya, Ilina, Albina, Rubashevskii, Aleksandr, Nakov, Preslav, Shelmanov, Artem, Panov, Maxim. Uncertainty Quantification for LLMs through Minimum Bayes Risk: Bridging Confidence and Consistency. arXiv preprint arXiv:2502.04964. 2025. https://arxiv.org/abs/2502.04964
Jaynes, Edwin T. Probability Theory: The Logic of Science. Cambridge University Press. 2003. https://doi.org/10.1017/CBO9780511790423
Bowyer, Sam, Aitchison, Laurence, Ivanova, Desi R. Position: Don't Use the CLT in LLM Evals With Fewer Than a Few Hundred Datapoints. arXiv preprint arXiv:2503.01747. 2025. https://arxiv.org/abs/2503.01747
Hariri, Mohsen, Hinczewski, Michael, Ma, Jing, Chaudhary, Vipin. Ranking Reasoning LLMs under Test-Time Scaling. The 64th Annual Meeting of the Association for Computational Linguistics. 2026. https://openreview.net/forum?id=DjRkQvirQL
Chen, Xingyu, Xu, Jiahao, Liang, Tian, He, Zhiwei, Pang, Jianhui, Yu, Dian, Song, Linfeng, Liu, Qiuzhi, Zhou, Mengfei, Zhang, Zhuosheng, Wang, Rui, Tu, Zhaopeng, Mi, Haitao, Yu, Dong. Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs. arXiv preprint arXiv:2412.21187. 2024. https://arxiv.org/abs/2412.21187
Shudong Liu, Hongwei Liu, Junnan Liu, Linchen Xiao, Songyang Gao, Chengqi Lyu, Yuzhe Gu, Wenwei Zhang, Derek F. Wong, Songyang Zhang, Kai Chen. CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward. arXiv preprint arXiv:2508.03686. 2025. https://arxiv.org/abs/2508.03686
Guo, Daya, Yang, Dejian, Zhang, Haowei, Song, Junxiao, Zhang, Ruoyu, Xu, Runxin, Zhu, Qihao, Ma, Shirong, Wang, Peiyi, Bi, Xiao, others. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv preprint arXiv:2501.12948. 2025. https://arxiv.org/abs/2501.12948
Shao, Zhihong, Wang, Peiyi, Zhu, Qihao, Xu, Runxin, Song, Junxiao, Bi, Xiao, Zhang, Haowei, Zhang, Mingchuan, Li, Y. K., Wu, Y., Guo, Daya. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv preprint arXiv:2402.03300. 2024. https://arxiv.org/abs/2402.03300
Tong, Yuxuan, Zhang, Xiwen, Wang, Rui, Wu, Ruidong, He, Junxian. DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving. Advances in Neural Information Processing Systems. 2024. https://proceedings.neurips.cc/paper%5Ffiles/paper/2024/file/0ef1afa0daa888d695dcd5e9513bafa3-Paper-Conference.pdf
Liu, Bingbin, Bubeck, Sebastien, Eldan, Ronen, Kulkarni, Janardhan, Li, Yuanzhi, Nguyen, Anh, Ward, Rachel, Zhang, Yi. TinyGSM: Achieving 80% on GSM8K with Small Language Models. misc. 2023. https://arxiv.org/abs/2312.09241
Hwang, Hyeonbin, Kim, Doyoung, Kim, Seungone, Ye, Seonghyeon, Seo, Minjoon. Self-Explore: Enhancing Mathematical Reasoning in Language Models with Fine-grained Rewards. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. https://aclanthology.org/2024.findings-emnlp.78/
Yang, Yuqing, Ma, Yan, Liu, Pengfei. Weak-to-Strong Reasoning. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. https://aclanthology.org/2024.findings-emnlp.490/
Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Cand\`es, Tatsunori Hashimoto. s1: Simple test-time scaling. misc. 2025. https://arxiv.org/abs/2501.19393
Chen, Feng, Ravent\'os, Allan, Cheng, Nan, Ganguli, Surya, Druckmann, Shaul. Rethinking Fine-Tuning when Scaling Test-Time Compute: Limiting Confidence Improves Mathematical Reasoning. misc. 2025. https://arxiv.org/abs/2502.07154
Kyunghoon Bae, Eunbi Choi, Kibong Choi, Stanley Jungkyu Choi, Yemuk Choi, Seokhee Hong, Junwon Hwang, Hyojin Jeon, Kijeong Jeon, Gerrard Jeongwon Jo, Hyunjik Jo, Jiyeon Jung, Hyosang Kim, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Yongil Kim, Youchul Kim, Edward Hwayoung Lee, Haeju Lee, Honglak Lee, Jinsik Lee, Kyungmin Lee, Sangha Park, Yongmin Park, Sihoon Yang, Heuiyeen Yeen, Sihyuk Yi, Hyeongu Yun. EXAONE Deep: Reasoning Enhanced Language Models. arXiv preprint arXiv:2503.12524. 2025. https://arxiv.org/abs/2503.12524
Itay Nakash, George Kour, Koren Lazar, Matan Vetzler, Guy Uziel, Ateret Anaby Tavor. Effective Red-Teaming of Policy-Adherent Agents. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. https://aclanthology.org/2025.emnlp-main.114/
Hojjat Aghakhani, Wei Dai, Andre Manoel, Xavier Fernandes, Anant Kharkar, Christopher Kruegel, Giovanni Vigna, David Evans, Ben Zorn, Robert Sim. TrojanPuzzle: Covertly Poisoning Code-Suggestion Models. IEEE Symposium on Security and Privacy, SP 2024, San Francisco, CA, USA, May 19 - 23, 2024. 2024. https://doi.org/10.1109/SP54263.2024.00140
Liu, Hongyi, Zhong, Shaochen, Sun, Xintong, Tian, Minghao, Hariri, Mohsen, Liu, Zirui, Tang, Ruixiang, Jiang, Zhimeng, Yuan, Jiayi, Chuang, Yu-Neng, Li, Li, Choi, Soo-Hyun, Chen, Rui, Chaudhary, Vipin, Hu, Xia. LoRATK: LoRA Once, Backdoor Everywhere in the Share-and-Play Ecosystem. arXiv preprint arXiv:2403.00108. 2024. https://arxiv.org/abs/2403.00108
Shenao Yan, Shen Wang, Yue Duan, Hanbin Hong, Kiho Lee, Doowon Kim, Yuan Hong. An LLM-Assisted Easy-to-Trigger Backdoor Attack on Code Completion Models: Injecting Disguised Vulnerabilities against Strong Detection. 33rd USENIX Security Symposium (USENIX Security 24). 2024. https://www.usenix.org/conference/usenixsecurity24/presentation/yan
Lakshmi Likhitha Mankali, Jitendra Bhandari, Manaar Alam, Ramesh Karri, Michail Maniatakos, Ozgur Sinanoglu, Johann Knechtel. RTL-Breaker: Assessing the Security of LLMs against Backdoor Attacks on HDL Code Generation. arXiv preprint arXiv:2411.17569. 2024. https://arxiv.org/abs/2411.17569
Rylan Schaeffer, Joshua Kazdan, John Hughes, Jordan Juravsky, Sara Price, Aengus Lynch, Erik Jones, Robert Kirk, Azalia Mirhoseini, Sanmi Koyejo. How Do Large Language Monkeys Get Their Power (Laws)?. Proceedings of the 42nd International Conference on Machine Learning. 2025. https://proceedings.mlr.press/v267/schaeffer25a.html
Yao, Shunyu, Shinn, Noah, Razavi, Pedram, Narasimhan, Karthik. -bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv preprint. 2024. https://doi.org/10.48550/arXiv.2406.12045
Biderman, Stella, Schoelkopf, Hailey, Sutawika, Lintang, Gao, Leo, Tow, Jonathan, Abbasi, Baber, Aji, Alham Fikri, Ammanamanchi, Pawan Sasanka, Black, Sidney, Clive, Jordan, DiPofi, Anthony, Etxaniz, Julen, Fattori, Benjamin, Forde, Jessica Zosa, Foster, Charles, Hsu, Jeffrey, Jaiswal, Mimansa, Lee, Wilson Y., Li, Haonan, Lovering, Charles, Muennighoff, Niklas, Pavlick, Ellie, Phang, Jason, Skowron, Aviya, Tan, Samson, Tang, Xiangru, Wang, Kevin A., Winata, Genta Indra, Yvon, Fran, Zou, Andy. Lessons from the Trenches on Reproducible Evaluation of Language Models. arXiv preprint arXiv:2405.14782. 2024. https://arxiv.org/abs/2405.14782
Vossler, Patrick, Xia, Fan, Mai, Yifan, Subbaswamy, Adarsh, Feng, Jean. LLMs Judging LLMs: A Simplex Perspective. arXiv preprint arXiv:2505.21972. 2025. https://arxiv.org/abs/2505.21972
Angelopoulos, Anastasios N, Bates, Stephen, Fannjiang, Clara, Jordan, Michael I, Zrnic, Tijana. Prediction-powered inference. Science. 2023. https://www.science.org/doi/10.1126/science.adi6000
Oosterhuis, Harrie, Jagerman, Rolf, Qin, Zhen, Wang, Xuanhui, Bendersky, Michael. Reliable Confidence Intervals for Information Retrieval Evaluation Using Generative A.I.. Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2024. https://doi.org/10.1145/3637528.3671883
Wolf, Thomas, Debut, Lysandre, Sanh, Victor, Chaumond, Julien, Delangue, Clement, Moi, Anthony, Cistac, Pierric, Rault, Tim, Louf, R\'emi, Funtowicz, Morgan, Davison, Joe, Shleifer, Sam, von Platen, Patrick, Ma, Clara, Jernite, Yacine, Plu, Julien, Xu, Canwen, Le Scao, Teven, Gugger, Sylvain, Drame, Mariama, Lhoest, Quentin, Rush, Alexander M.. HuggingFace's Transformers: State-of-the-art Natural Language Processing. arXiv preprint arXiv:1910.03771. 2019. https://arxiv.org/abs/1910.03771
Brown, Bradley, Juravsky, Jordan, Ehrlich, Ryan, Clark, Ronald, Le, Quoc V, R\'e, Christopher, Mirhoseini, Azalia. Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787. 2024. https://arxiv.org/abs/2407.21787
Papineni, Kishore, Roukos, Salim, Ward, Todd, Zhu, Wei-Jing. Bleu: a Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002. https://aclanthology.org/P02-1040/
Ren, Shuo, Guo, Daya, Lu, Shuai, Zhou, Long, Liu, Shujie, Tang, Duyu, Sundaresan, Neel, Zhou, Ming, Blanco, Ambrosio, Ma, Shuai. CodeBLEU: A Method for Automatic Evaluation of Code Synthesis. arXiv preprint arXiv:2009.10297. 2020. https://arxiv.org/abs/2009.10297
Kulal, Sumith, Pasupat, Panupong, Chandra, Kartik, Lee, Mina, Padon, Oded, Aiken, Alex, Liang, Percy S. SPoC: Search-based Pseudocode to Code. Advances in Neural Information Processing Systems. 2019. https://arxiv.org/abs/1906.04908
Hendrycks, Dan, Burns, Collin, Kadavath, Saurav, Arora, Akul, Basart, Steven, Tang, Eric, Song, Dawn, Steinhardt, Jacob. Measuring Mathematical Problem Solving With the MATH Dataset. arXiv preprint arXiv:2103.03874. 2021. https://arxiv.org/abs/2103.03874
Cobbe, Karl, Kosaraju, Vineet, Bavarian, Mohammad, Chen, Mark, Jun, Heewoo, Kaiser, Lukasz, Plappert, Matthias, Tworek, Jerry, Hilton, Jacob, Nakano, Reiichiro, Hesse, Christopher, Schulman, John. Training Verifiers to Solve Math Word Problems. arXiv preprint arXiv:2110.14168. 2021. https://arxiv.org/abs/2110.14168
Wang, Xuezhi, Wei, Jason, Schuurmans, Dale, Le, Quoc, Chi, Ed, Narang, Sharan, Chowdhery, Aakanksha, Zhou, Denny. Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv preprint arXiv:2203.11171. 2022. https://arxiv.org/abs/2203.11171
Lewkowycz, Aitor, Andreassen, Anders, Dohan, David, Dyer, Ethan, Michalewski, Henryk, Ramasesh, Vinay, Slone, Ambrose, Anil, Cem, Schlag, Imanol, Gutman-Solo, Theo, Wu, Yuhuai, Neyshabur, Behnam, Gur-Ari, Guy, Misra, Vedant. Solving Quantitative Reasoning Problems with Language Models. Advances in Neural Information Processing Systems. 2022. https://proceedings.neurips.cc/paper%5Ffiles/paper/2022/file/18abbeef8cfe9203fdf9053c9c4fe191-Paper-Conference.pdf
KENDALL, M. G.. A NEW MEASURE OF RANK CORRELATION. Biometrika. 1938. https://doi.org/10.1093/biomet/30.1-2.81
NovaSky Team. Think Less, Achieve More: Cut Reasoning Costs by 50% Without Sacrificing Accuracy. misc. 2025. https://novasky-ai.github.io/posts/reduce-overthinking
Yang, An, Li, Anfeng, Yang, Baosong, Zhang, Beichen, Hui, Binyuan, Zheng, Bo, Yu, Bowen, Gao, Chang, Huang, Chengen, Lv, Chenxu, Zheng, Chujie, Liu, Dayiheng, Zhou, Fan, Huang, Fei, Hu, Feng, Ge, Hao, Wei, Haoran, Lin, Huan, Tang, Jialong, Yang, Jian, Tu, Jianhong, Zhang, Jianwei, Yang, Jianxin, Yang, Jiaxi, Zhou, Jing, Zhou, Jingren, Lin, Junyang, Dang, Kai, Bao, Keqin, Yang, Kexin, Yu, Le, Deng, Lianghao, Li, Mei, Xue, Mingfeng, Li, Mingze, Zhang, Pei, Wang, Peng, Zhu, Qin, Men, Rui, Gao, Ruize, Liu, Shixuan, Luo, Shuang, Li, Tianhao, Tang, Tianyi, Yin, Wenbiao, Ren, Xingzhang, Wang, Xinyu, Zhang, Xinyu, Ren, Xuancheng, Fan, Yang, Su, Yang, Zhang, Yichang, Zhang, Yinger, Wan, Yu, Liu, Yuqiong, Wang, Zekun, Cui, Zeyu, Zhang, Zhenru, Zhou, Zhipeng, Qiu, Zihan. Qwen3 Technical Report. misc. 2025. https://arxiv.org/abs/2505.09388
OpenAI. gpt-oss-120b & gpt-oss-20b Model Card. misc. 2025. https://arxiv.org/abs/2508.10925
Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, Pengfei Liu. LIMO: Less is More for Reasoning. misc. 2025. https://arxiv.org/abs/2502.03387
Kyunghoon Bae, Eunbi Choi, Kibong Choi, Stanley Jungkyu Choi, Yemuk Choi, Kyubeen Han, Seokhee Hong, Junwon Hwang, Taewan Hwang, Joonwon Jang, Hyojin Jeon, Kijeong Jeon, Gerrard Jeongwon Jo, Hyunjik Jo, Jiyeon Jung, Euisoon Kim, Hyosang Kim, Jihoon Kim, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Yongil Kim, Youchul Kim, Edward Hwayoung Lee, Gwangho Lee, Haeju Lee, Honglak Lee, Jinsik Lee, Kyungmin Lee, Sangha Park, Young Min Paik, Yongmin Park, Youngyong Park, Sanghyun Seo, Sihoon Yang, Heuiyeen Yeen, Sihyuk Yi, Hyeongu Yun. EXAONE 4.0: Unified Large Language Models Integrating Non-reasoning and Reasoning Modes. arXiv preprint arXiv:2507.11407. 2025. https://arxiv.org/abs/2507.11407
Shubham Toshniwal, Ivan Sorokin, Aleksander Ficek, Ivan Moshkov, Igor Gitman. GenSelect: A Generative Approach to Best-of-N. 2nd AI for Math Workshop @ ICML 2025. 2025. https://openreview.net/forum?id=8LhnmNmUDb
Ivan Moshkov, Darragh Hanley, Ivan Sorokin, Shubham Toshniwal, Christof Henkel, Benedikt Schifferer, Wei Du, Igor Gitman. AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset. misc. 2025. https://arxiv.org/abs/2504.16891
Wasi Uddin Ahmad, Somshubra Majumdar, Aleksander Ficek, Sean Narenthiran, Mehrzad Samadi, Jocelyn Huang, Siddhartha Jain, Vahid Noroozi, Boris Ginsburg. OpenCodeReasoning-II: A Simple Test Time Scaling Approach via Self-Critique. arXiv preprint arXiv:2507.09075. 2025. https://arxiv.org/abs/2507.09075
Wasi Uddin Ahmad, Sean Narenthiran, Somshubra Majumdar, Aleksander Ficek, Siddhartha Jain, Jocelyn Huang, Vahid Noroozi, Boris Ginsburg. OpenCodeReasoning: Advancing Data Distillation for Competitive Coding. arXiv preprint arXiv:2504.01943. 2025. https://arxiv.org/abs/2504.01943
Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Benjamin Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik Sharma, Charlie Cheng-Jie Ji, Yichuan Deng, Sarah Pratt, Vivek Ramanujan, Jon Saad-Falcon, Jeffrey Li, Achal Dave, Alon Albalak, Kushal Arora, Blake Wulfe, Chinmay Hegde, Greg Durrett, Sewoong Oh, Mohit Bansal, Saadia Gabriel, Aditya Grover, Kai-Wei Chang, Vaishaal Shankar, Aaron Gokaslan, Mike A. Merrill, Tatsunori Hashimoto, Yejin Choi, Jenia Jitsev, Reinhard Heckel, Maheswaran Sathiamoorthy, Alexandros G. Dimakis, Ludwig Schmidt. OpenThoughts: Data Recipes for Reasoning Models. misc. 2025. https://arxiv.org/abs/2506.04178
Abdin, Marah, Agarwal, Sahaj, Awadallah, Ahmed, Balachandran, Vidhisha, Behl, Harkirat, Chen, Lingjiao, de Rosa, Gustavo, Gunasekar, Suriya, Javaheripi, Mojan, Joshi, Neel, Kauffmann, Piero, Lara, Yash, Mendes, Caio C\'esar Teodoro, Mitra, Arindam, Nushi, Besmira, Papailiopoulos, Dimitris, Saarikivi, Olli, Shah, Shital, Shrivastava, Vaishnavi, Vineet, Vibhav, Wu, Yue, Yousefi, Safoora, Zheng, Guoqing. Phi-4-reasoning Technical Report. arXiv preprint arXiv:2504.21318. 2025. https://arxiv.org/abs/2504.21318
Hugging Face. Open R1: A fully open reproduction of DeepSeek-R1. misc. 2025. https://github.com/huggingface/open-r1
Wan, Fanqi, Zhong, Longguang, Yang, Ziyi, Chen, Ruijun, Quan, Xiaojun. FuseChat: Knowledge Fusion of Chat Models. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. https://aclanthology.org/2025.emnlp-main.1096/
Wen, Liang, Cai, Yunke, Xiao, Fenrui, He, Xin, An, Qi, Duan, Zhenyu, Du, Yimin, Liu, Junchen, Tang, Lifu, Lv, Xiaowei, Zou, Haosheng, Deng, Yongchao, Jia, Shousheng, Zhang, Xiangzheng. Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track). 2025. https://aclanthology.org/2025.acl-industry.24/
Liu, Zihan, Yang, Zhuolin, Chen, Yang, Lee, Chankyu, Shoeybi, Mohammad, Catanzaro, Bryan, Ping, Wei. AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy. arXiv preprint arXiv:2506.13284. 2025. https://arxiv.org/abs/2506.13284
Basant, Aarti, Khairnar, Abhijit, Paithankar, Abhijit, Khattar, Abhinav, Renduchintala, Adithya, Malte, Aditya, Bercovich, Akhiad, Hazare, Akshay, Rico, Alejandra, Ficek, Aleksander, others. NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model. arXiv preprint arXiv:2508.14444. 2025. https://arxiv.org/abs/2508.14444
Bespoke Labs. Bespoke-Stratos: The unreasonable effectiveness of reasoning distillation. misc. 2025. https://www.bespokelabs.ai/blog/bespoke-stratos-the-unreasonable-effectiveness-of-reasoning-distillation
Sylvain Gugger, Lysandre Debut, Thomas Wolf, Philipp Schmid, Zachary Mueller, Sourab Mangrulkar, Marc Sun, Benjamin Bossan. Accelerate: Training and inference at scale made simple, efficient and adaptable.. misc. 2022. https://github.com/huggingface/accelerate
Tianyi Zhang, Mohsen Hariri, Shaochen Zhong, Vipin Chaudhary, Yang Sui, Xia Hu, Anshumali Shrivastava. 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float. Advances in Neural Information Processing Systems (NeurIPS 2025). 2025. https://arxiv.org/abs/2504.11651

Abstract

Introduction

Bayesian Framework for Evaluating LLM Performance

Background: The Pass@k Metric and Its Limitations

Results Matrix

Weighted Performance Metric

Bayesian Estimator and Uncertainty for the Performance Metric

Using Uncertainty Estimates to Decide Significance of Performance Differences

Equivalence of Bayesian and Average Rankings for Uniform Prior

Gold Standard for Ranking

Potential benefits of non-uniform priors

Ranking with Uncertainty

Experiments

Convergence to Gold Standard

Rankings With Credible Intervals

Convergence

Worst-case scenarios

Rubric-Aware Categorical Evaluation

Related Work

Conclusion: Strengths, Limitations & Future Directions

Ethics Statement

Reproducibility Statement

Acknowledgments

Appendix

Derivation of Bayesian Estimator and Uncertainty

Proof of Equivalence of Bayesian and Average Rankings for Uniform Prior

Runtime

Categorical Evaluation

Rubric-aware Bayes@N Evaluation of Reasoning Models

Base signals

Reward models in evaluation.

Selected categorical schema.

Domain-agnostic rubric-aware Bayes@N

Scorio

Installation.

Basic usage.

Credible intervals.

Categorical (rubric-based) evaluation.

Incorporating prior evidence.

Extended Related Work

Experiment Setup and Reproducibility

Metrics

Kendall's Tau:

Convergence@n.

Models and Datasets

Datasets.

Models.

Prompting.

Reproducibility

Computational Cost and Token Statistics

Task-level computational cost.

Token breakdown.

GPU-hours by model efficiency.

Convergence

References

Background: The Pass@ $k$ Metric and Its Limitations

`Scorio`

Convergence@ $n$ .