Ranking Reasoning LLMs under Test-Time Scaling

TL;DR: Test-time scaling samples a reasoning model many times per question, so a leaderboard is no longer one score per model but a stack of $0/1$ outcomes, one per model–question–trial. We turned that stack into a response tensor and asked which of the many ways to rank models you should trust. Across $20$ models and four Olympiad-style math benchmarks at $N=80$ trials, $72$ ranking methods, spanning $\mathrm{Bayes}@N$ , Bradley–Terry, Elo, IRT, voting rules, and PageRank, mostly agree: mean Kendall's $\tau_b$ with the accuracy-based gold standard is $0.93$ – $0.95$ , and $19$ – $34$ of them reproduce the exact same order. The choice of method only starts to matter when the budget collapses to a single trial. There, a greedy-decode prior ( $\mathrm{Bayes}_{\mathbf{R}_0}@N$ ) cuts ranking variance by $16$ – $52\%$ , but biases the order when greedy and sampled decoding disagree. All of it ships in scorio.

Full-trial rankings of many methods against the gold standard on an easy and a hard benchmark

The question left over from Bayes@N

The previous paper settled a scoring question: replace Pass@ $k$ with a posterior estimate, $\mathrm{Bayes}@N$ , and you get stable numbers and honest uncertainty from far fewer samples. But a score is not a leaderboard. The moment we had the full dataset in hand ( $20$ reasoning models, four benchmarks, $80$ stochastic trials for every model–question pair), the open question changed shape. It was no longer "how do I score one model?" It was "how do I order twenty of them, and does it even matter which method I use?"

That question has an embarrassing number of answers. Statistics has been ranking things for two centuries: Bradley–Terry and Elo for head-to-head games, Borda and Copeland for votes, PageRank and HodgeRank for graphs, item response theory for tests. Each was designed for a different world, and each will happily produce a leaderboard from the same data. I expected them to disagree, and to spend the paper arguing about which one is right.

They mostly don't disagree. When you feed all $80$ trials to $72$ different ranking methods, the orderings pile up on top of each other. On the easier benchmark, several methods land on exactly the same order (the left panel of the figure above). The interesting behavior is not at full budget. It shows up when you can only afford one trial per question, which is exactly the regime a lab actually lives in.

Use it in practice

scorio installs with pip install scorio. It scores an outcome matrix (the Bayes@N side) and, new here, ranks models directly from a response tensor. Every ranking method takes the same tensor $\mathbf{R}$ of shape (L, M, N) and returns a $1$ -indexed ranking where lower is better.

import numpy as np
from scorio import rank
 
# Response tensor: L=20 models, M=30 questions, N=80 trials, entries in {0, 1}
R = np.load("responses.npy")            # shape (20, 30, 80)
 
# Accuracy-based ranking (the gold-standard target)
ranking = rank.avg(R)
 
# Bayesian posterior-mean ranking (order-equivalent to avg@N under a uniform prior)
ranking, scores = rank.bayes(R, return_scores=True)

Swapping the method is a one-liner, which is what makes comparing them cheap:

Function	Family	What it computes
`rank.avg(R)`	pointwise	Mean-accuracy ranking (the gold-standard target)
`rank.bayes(R, R0=…, w=…, quantile=…)`	Bayesian metric	Posterior-mean or lower-bound ranking; optional greedy prior and categorical weights
`rank.bradley_terry(R)`	paired-comparison	Latent-strength ranking from pairwise win counts
`rank.elo(R)` / `rank.trueskill(R)`	rating	Sequential / Bayesian rating over the induced match stream
`rank.borda(R)` / `rank.copeland(R)`	voting	Treat each question as a voter and aggregate
`rank.pagerank(R)` / `rank.hodge_rank(R)`	graph / spectral	Rank from the pairwise comparison graph
`rank.rasch_mml_credible(R)`	IRT	Latent-ability estimate with a conservative posterior bound

from scorio import rank, utils
 
gold = rank.bayes(R)     # Bayes_U@80, the accuracy-based reference
methods = {
    "avg":           rank.avg(R),
    "bradley_terry": rank.bradley_terry(R),
    "borda":         rank.borda(R),
    "pagerank":      rank.pagerank(R),
    "rasch_mml":     rank.rasch_mml_credible(R),
}
for name, r in methods.items():
    tau = utils.compare_rankings(r, gold)   # rank agreement (Kendall's tau_b)
    print(f"{name:14s} tau_b = {tau:.3f}")

A benchmark is a tensor, and every ranking method is a projection of it

Under test-time scaling, a benchmark stops being a table of one score per model. Every model–question pair is attempted $N$ times, so the raw evidence is a three-dimensional grid: model, question, trial, each cell a $1$ if that attempt was correct. Call it the response tensor. A single-run benchmark is just the $N=1$ slice of it.

Once you see the data this way, the zoo of ranking methods becomes less intimidating. They are not competing theories of the world; they are different ways of flattening the same tensor. Average accuracy and IRT read it pointwise, as a per-question solve rate. Bradley–Terry, Elo, and the voting rules read it pairwise, as counts of which model beat which. Plackett–Luce reads it setwise, as the set of models that solved each question–trial. What a method keeps or throws away in that flattening step is the whole story of why two leaderboards can differ.

Let $\mathcal{L}=\{1,\dots,L\}$ index models and $\mathcal{Q}=\{1,\dots,M\}$ questions, with $N$ i.i.d. trials each. We observe binary outcomes

R_{lmn}\in\{0,1\}, \qquad \mathbf{R}\in\{0,1\}^{L\times M\times N},

where $R_{lmn}=1$ if model $l$ solves question $m$ on trial $n$ . The natural pointwise summary is the per-question solve rate and its mean,

\widehat{p}_{lm} := \frac{1}{N}\sum_{n=1}^N R_{lmn}, \qquad \widehat{p}_{l} := \frac{1}{M}\sum_{m=1}^M \widehat{p}_{lm}.

Because there is no universal ground truth for ranking methods, we score them against two targets. The first is an accuracy-based gold standard: the full-budget ordering $\mathrm{Bayes}_{\mathcal{U}}@80$ , the Bayesian posterior mean with a uniform prior over all $80$ trials, which is order-equivalent to $\mathrm{avg}@80$ and allows ties. The second is a self-consistency target: the method's own full-trial ordering, method@80, which asks whether a method computed from one trial already agrees with the same method computed from eighty. Agreement is measured with Kendall's $\tau_b$ (tie-aware), so $\tau_b=1$ means an identical order.

The three representations

Every method we study consumes $\mathbf{R}$ but operates on a projection of it.

Pointwise (model–question). Methods work on the matrix $\widehat{\mathbf{P}}=[\widehat{p}_{lm}]\in[0,1]^{L\times M}$ or its row means. Mean accuracy, inverse-difficulty weighting, and IRT-style models live here; when $N>1$ the trial axis is a stack of repeated Bernoulli observations with sufficient statistic $k_{lm}:=\sum_n R_{lmn}$ , giving a binomial-response model. Evaluation metrics such as Pass@ $k$ and $\mathrm{Bayes}@N$ additionally use the per-question trial multiset before averaging over questions.

Pairwise (win / tie). For an ordered pair $(i,j)$ define

W_{ij} := \sum_{m,n} \mathbf{1}\{R_{imn}=1, R_{jmn}=0\}, \qquad T_{ij} := \sum_{m,n} \mathbf{1}\{R_{imn}=R_{jmn}\},

so that $W_{ij}+W_{ji}+T_{ij}=MN$ for every $i\neq j$ . In our fully-observed setting the comparison graph is complete, unlike Chatbot Arena, where the graph is sparse and evolving. Bradley–Terry and its tie extensions, Borda and Copeland, and graph/spectral methods (PageRank, Rank Centrality, HodgeRank, Nash averaging) all consume $(W_{ij},T_{ij})$ , typically via the tied-split win rate $\widehat{P}_{i\succ j}=(W_{ij}+\tfrac12 T_{ij})/(W_{ij}+W_{ji}+T_{ij})$ . Elo and TrueSkill instead replay the underlying stream of question–trial "matches."

Setwise (winner sets). For each question–trial $(m,n)$ the winner set $U_{mn}:=\{l:R_{lmn}=1\}$ ties above its complement. Plackett–Luce and Davidson–Luce operate on the collection $\{(U_{mn}, \mathcal{L}\setminus U_{mn})\}$ , discarding the all-solved and none-solved events that carry no ranking information.

A consequence worth stating plainly: even as $M$ or $N$ grows, these methods need not converge to a single limiting order. Probabilistic paired-comparison models can emphasize different aspects of performance than an expected-accuracy metric, which is why "compute more trials" does not by itself make the choice of ranking method moot.

With 80 trials, the ranking method barely matters

This is the reassuring half of the result, and the one I did not expect. When every method gets the full $N=80$ , they agree with the accuracy-based gold standard, and largely with each other. The mean Kendall's $\tau_b$ between $\mathrm{Bayes}_{\mathcal{U}}@80$ and the other $71$ methods is $0.93$ – $0.95$ per benchmark, the median is $0.95$ – $0.99$ , and a large block of methods reproduces the exact same ordering.

Benchmark	Mean $\tau_b$	Median	Min	#( $\tau_b=1$ )	#( $\tau_b\ge 0.95$ )
AIME'24	0.941	0.989	0.682	20	40
AIME'25	0.934	0.947	0.771	19	29
HMMT'25	0.950	0.989	0.758	34	44
BrUMO'25	0.954	0.968	0.789	26	49
Combined	0.962	0.989	0.748	22	53

Statistics are over the other $71$ methods, all computed from the full $80$ trials. The stragglers are a handful of voting rules (minimax and Nanson variants) and difficulty-weighted baselines.

The takeaway is a default: if you can afford a large trial budget, pick the simple, interpretable option. $\mathrm{Bayes}_{\mathcal{U}}@N$ is exactly average accuracy in ranking terms, plus uncertainty for free. The exotic machinery neither helps nor hurts once the data is plentiful.

The choice only bites at one trial

Cut the budget to a single trial per question and the methods separate. We subsample one of the $80$ trials, recompute every ranking, repeat over all $80$ single-trial draws, and report the mean $\tau_b \pm$ its standard deviation. (Pass@ $k$ needs at least two trials, so $69$ methods remain at $N=1$ .) Now the best method depends on which target you name: agreement with the accuracy gold standard, or self-consistency with a method's own full-budget order.

Benchmark	Best vs. gold standard	$\tau_b$	Best self-consistency	$\tau_b$
AIME'24	$\mathrm{Bayes}_{\mathbf{R}_0}@1$	$0.779 \pm 0.034$	Rasch MML (LCB)	$0.804 \pm 0.051$
AIME'25	$\mathrm{Bayes}_{\mathbf{R}_0}@1$	$0.798 \pm 0.045$	Rasch MML (LCB)	$0.834 \pm 0.054$
HMMT'25	$\mathrm{Bayes}_{\mathcal{U}}@1$ (+20 tied)	$0.790 \pm 0.053$	Rasch MML (LCB)	$0.810 \pm 0.056$
BrUMO'25	$\mathrm{Bayes}_{\mathbf{R}_0}@1$	$0.858 \pm 0.028$	$\mathrm{Bayes}_{\mathbf{R}_0}@1$	$0.858 \pm 0.028$
Combined	$\mathrm{Bayes}_{\mathcal{U}}@1$ (+20 tied)	$0.865 \pm 0.049$	Nanson (avg ties)	$0.892 \pm 0.050$

Two things stand out. First, high self-consistency does not imply closeness to the accuracy order: Nanson's rule is the most repeatable method on the combined benchmark ( $\tau_b=0.892$ ) yet trails badly on gold-standard agreement ( $0.807$ ). A method can converge cleanly to its own answer while that answer drifts away from accuracy. Second, the low-budget winner is often the Bayesian estimator with a greedy prior, which turns out to be a double-edged tool.

These conclusions are not an artifact of the particular $20$ models. Re-running the $N=1$ analysis on $1000$ bootstrapped model pools of size $5$ , $10$ , and $15$ keeps the same winners; larger pools mainly shrink the between-subset spread. On AIME'24, the across-subset standard deviation of the top method falls from $0.209$ at five models to $0.057$ at fifteen. A bigger pool does not change the recommendation; it makes it more certain.

A greedy prior buys stability, but can move the ranking

The most useful and most dangerous knob in the low-budget regime is the empirical prior. Alongside the $80$ stochastic trials, we collect one greedy decode per question, $\mathbf{R}_0$ , and fold it into the posterior as pseudo-counts. That gives $\mathrm{Bayes}_{\mathbf{R}_0}@N$ , which is just $\mathrm{Bayes}@N$ shrunk toward the greedy ordering.

Shrinkage does what shrinkage always does: it trades variance for bias. At $N=1$ the greedy prior reduces the standard deviation of $\tau_b$ by $16$ – $52\%$ depending on the benchmark. The advantage fades quickly as $N$ grows, because a single greedy run only contributes $O(1)$ pseudo-counts per question.

Gold-standard agreement of uniform vs. greedy-prior Bayes rankings as the trial budget grows, per benchmark

The catch is the mean. Cutting variance is only good if you were aiming at the right target. On three benchmarks the greedy prior nudges the mean $\tau_b$ up; on HMMT'25, the hardest one, it pushes the ranking away from the accuracy order. The sign of the shift tracks a single diagnostic: how well greedy decoding and stochastic sampling already agree.

Benchmark	Difficulty	$\tau_{\text{G-S}}$	$\Delta\tau$ (greedy − uniform)	Std. reduction
AIME'24	0.620	0.739	+0.020	42%
AIME'25	0.533	0.660	+0.008	17%
HMMT'25	0.333	0.635	−0.022	16%
BrUMO'25	0.588	0.768	+0.049	52%

$\tau_{\text{G-S}}$ is the greedy–sampling rank alignment: Kendall's $\tau_b$ between the ranking induced by greedy decoding and by stochastic sampling at $N=80$ . Higher alignment goes with a more positive $\Delta\tau$ . The prior helps most on BrUMO'25 (aligned, $+0.049$ ) and hurts on HMMT'25 (least aligned, $-0.022$ ).

# One greedy decode per question, shared across models: shape (M, D)
R0 = np.load("greedy.npy")               # shape (30, 1)
 
ranking_uniform = rank.bayes(R)          # Bayes_U@N
ranking_greedy  = rank.bayes(R, R0=R0)   # Bayes_R0@N, greedy empirical prior
 
# Conservative, uncertainty-aware ranking via a posterior lower bound
ranking_lcb = rank.bayes(R, R0=R0, quantile=0.05)

The mechanism is intuitive once you see the scatter of greedy vs. sampled ranks below. Greedy decoding under-explores on hard instances, where stochastic sampling can still stumble onto a correct chain. When the two policies rank models the same way, the prior is free stabilization; when they diverge, it quietly biases the leaderboard toward greedy behavior. That is why the prior is not free information; it changes the target unless you have checked the alignment on a small pilot.

Model ranks under greedy decoding vs. stochastic sampling, per benchmark

Beyond binary: categorical ranking has the same trap

The $\mathrm{Bayes}@N$ estimator is not limited to right/wrong. Map each completion to one of $C+1$ ordered categories, using signals like boxed-vs-unboxed answers, confidence (bits per token), token efficiency, or an external verifier, and attach a utility weight vector $\mathbf{w}=(w_0,\dots,w_C)$ . Bayesian estimation then runs on a Dirichlet–multinomial model instead of Beta–binomial, and rank.bayes(R_cat, w=w) ranks by the weighted posterior mean.

These auxiliary signals reproduce the self-consistency trap in a sharper form. Across categorical schemes at $N=1$ on the combined benchmark, the signal-rich schemes are the most self-consistent (Verifier-only reaches $\tau_{\text{Self}}=0.897$ ) yet the least faithful to the accuracy gold standard ( $\tau_{\text{GS}}=0.824$ ). The trade-off is a clean negative correlation: the more a scheme leans on auxiliary signals, the more stable and the more biased it becomes.

Gold-standard agreement vs. self-consistency across categorical schemes

The practical rule that falls out is a reporting discipline: state the category mapping and the utility weights, and never read higher self-consistency as higher accuracy fidelity. All eight representative schemes also correlate more strongly with the greedy-prior reference than the uniform one, the same mechanism as the empirical prior: verifier and format signals partly encode greedy-decoding behavior.

What to actually report

The paper is not a case for one exotic leaderboard. It is a case for a small amount of discipline once evaluation becomes a repeated-sampling problem:

Name the target. Accuracy agreement, self-consistency, and task-utility rankings are different objects. Decide which one you want before declaring a winner.
Report stability, not just a point ranking. At low budget, a leaderboard needs $\tau_b$ , an uncertainty estimate, and convergence as $N$ grows.
Use $\mathrm{Bayes}_{\mathcal{U}}@N$ as the default. It is transparent, order-equivalent to accuracy, and uncertainty-aware. Reach for a greedy prior only after checking greedy–sampling alignment on a pilot sample, and audit any auxiliary signal the same way.

Abstract

Test-time scaling evaluates reasoning LLMs by sampling multiple outputs per prompt, but ranking models in this regime remains underexplored. We formalize dense benchmark ranking under test-time scaling and introduce Scorio, a library that implements statistical ranking methods such as paired-comparison models, item response theory (IRT) models, voting rules, and graph- and spectral-based methods. Across $20$ reasoning models on four Olympiad-style math benchmarks (AIME'24, AIME'25, HMMT'25, and BrUMO'25; up to $N=80$ trials), most full-trial rankings agree closely with the Bayesian gold standard $\mathrm{Bayes}_{\mathcal{U}}@80$ (mean Kendall's $\tau_b = 0.93$ – $0.95$ ), and $19$ – $34$ methods recover exactly the same ordering. In the single-trial regime, the best methods reach $\tau_b \approx 0.86$ . Using greedy decoding as an empirical prior ( $\mathrm{Bayes}_{\mathbf{R}_0}@N$ ) reduces variance at $N=1$ by $16$ – $52\%$ , but can bias rankings when greedy and stochastic sampling disagree. These results identify reliable ranking methods for both high- and low-budget test-time scaling. We release Scorio as an open-source library at https://github.com/mohsenhariri/scorio.

This is the ranking half of the Bayes@N story, and I am still working on it, extending the analysis past binary correctness to partial credit and rubric-based scoring. If you want to compare notes or you are running your own repeated-trial evaluations, my email is in the footer.