Ranking Reasoning LLMs under Test-Time Scaling

    TL;DR: Test-time scaling samples a reasoning model many times per question, so a leaderboard is no longer one score per model but a stack of 0/10/1 outcomes, one per model–question–trial. We turned that stack into a response tensor and asked which of the many ways to rank models you should trust. Across 2020 models and four Olympiad-style math benchmarks at N=80N=80 trials, 7272 ranking methods, spanning Bayes@N\mathrm{Bayes}@N, Bradley–Terry, Elo, IRT, voting rules, and PageRank, mostly agree: mean Kendall's τb\tau_b with the accuracy-based gold standard is 0.930.930.950.95, and 19193434 of them reproduce the exact same order. The choice of method only starts to matter when the budget collapses to a single trial. There, a greedy-decode prior (BayesR0@N\mathrm{Bayes}_{\mathbf{R}_0}@N) cuts ranking variance by 161652%52\%, but biases the order when greedy and sampled decoding disagree. All of it ships in scorio.

    Full-trial rankings of many methods against the gold standard on an easy and a hard benchmark

    The question left over from Bayes@N

    The previous paper settled a scoring question: replace Pass@kk with a posterior estimate, Bayes@N\mathrm{Bayes}@N, and you get stable numbers and honest uncertainty from far fewer samples. But a score is not a leaderboard. The moment we had the full dataset in hand (2020 reasoning models, four benchmarks, 8080 stochastic trials for every model–question pair), the open question changed shape. It was no longer "how do I score one model?" It was "how do I order twenty of them, and does it even matter which method I use?"

    That question has an embarrassing number of answers. Statistics has been ranking things for two centuries: Bradley–Terry and Elo for head-to-head games, Borda and Copeland for votes, PageRank and HodgeRank for graphs, item response theory for tests. Each was designed for a different world, and each will happily produce a leaderboard from the same data. I expected them to disagree, and to spend the paper arguing about which one is right.

    They mostly don't disagree. When you feed all 8080 trials to 7272 different ranking methods, the orderings pile up on top of each other. On the easier benchmark, several methods land on exactly the same order (the left panel of the figure above). The interesting behavior is not at full budget. It shows up when you can only afford one trial per question, which is exactly the regime a lab actually lives in.

    Use it in practice

    scorio installs with pip install scorio. It scores an outcome matrix (the Bayes@N side) and, new here, ranks models directly from a response tensor. Every ranking method takes the same tensor R\mathbf{R} of shape (L, M, N) and returns a 11-indexed ranking where lower is better.

    import numpy as np
    from scorio import rank
     
    # Response tensor: L=20 models, M=30 questions, N=80 trials, entries in {0, 1}
    R = np.load("responses.npy")            # shape (20, 30, 80)
     
    # Accuracy-based ranking (the gold-standard target)
    ranking = rank.avg(R)
     
    # Bayesian posterior-mean ranking (order-equivalent to avg@N under a uniform prior)
    ranking, scores = rank.bayes(R, return_scores=True)

    Swapping the method is a one-liner, which is what makes comparing them cheap:

    FunctionFamilyWhat it computes
    rank.avg(R)pointwiseMean-accuracy ranking (the gold-standard target)
    rank.bayes(R, R0=…, w=…, quantile=…)Bayesian metricPosterior-mean or lower-bound ranking; optional greedy prior and categorical weights
    rank.bradley_terry(R)paired-comparisonLatent-strength ranking from pairwise win counts
    rank.elo(R) / rank.trueskill(R)ratingSequential / Bayesian rating over the induced match stream
    rank.borda(R) / rank.copeland(R)votingTreat each question as a voter and aggregate
    rank.pagerank(R) / rank.hodge_rank(R)graph / spectralRank from the pairwise comparison graph
    rank.rasch_mml_credible(R)IRTLatent-ability estimate with a conservative posterior bound
    from scorio import rank, utils
     
    gold = rank.bayes(R)     # Bayes_U@80, the accuracy-based reference
    methods = {
        "avg":           rank.avg(R),
        "bradley_terry": rank.bradley_terry(R),
        "borda":         rank.borda(R),
        "pagerank":      rank.pagerank(R),
        "rasch_mml":     rank.rasch_mml_credible(R),
    }
    for name, r in methods.items():
        tau = utils.compare_rankings(r, gold)   # rank agreement (Kendall's tau_b)
        print(f"{name:14s} tau_b = {tau:.3f}")

    A benchmark is a tensor, and every ranking method is a projection of it

    Under test-time scaling, a benchmark stops being a table of one score per model. Every model–question pair is attempted NN times, so the raw evidence is a three-dimensional grid: model, question, trial, each cell a 11 if that attempt was correct. Call it the response tensor. A single-run benchmark is just the N=1N=1 slice of it.

    Once you see the data this way, the zoo of ranking methods becomes less intimidating. They are not competing theories of the world; they are different ways of flattening the same tensor. Average accuracy and IRT read it pointwise, as a per-question solve rate. Bradley–Terry, Elo, and the voting rules read it pairwise, as counts of which model beat which. Plackett–Luce reads it setwise, as the set of models that solved each question–trial. What a method keeps or throws away in that flattening step is the whole story of why two leaderboards can differ.

    Let L={1,,L}\mathcal{L}=\{1,\dots,L\} index models and Q={1,,M}\mathcal{Q}=\{1,\dots,M\} questions, with NN i.i.d. trials each. We observe binary outcomes

    Rlmn{0,1},R{0,1}L×M×N,R_{lmn}\in\{0,1\}, \qquad \mathbf{R}\in\{0,1\}^{L\times M\times N},

    where Rlmn=1R_{lmn}=1 if model ll solves question mm on trial nn. The natural pointwise summary is the per-question solve rate and its mean,

    p^lm:=1Nn=1NRlmn,p^l:=1Mm=1Mp^lm.\widehat{p}_{lm} := \frac{1}{N}\sum_{n=1}^N R_{lmn}, \qquad \widehat{p}_{l} := \frac{1}{M}\sum_{m=1}^M \widehat{p}_{lm}.

    Because there is no universal ground truth for ranking methods, we score them against two targets. The first is an accuracy-based gold standard: the full-budget ordering BayesU@80\mathrm{Bayes}_{\mathcal{U}}@80, the Bayesian posterior mean with a uniform prior over all 8080 trials, which is order-equivalent to avg@80\mathrm{avg}@80 and allows ties. The second is a self-consistency target: the method's own full-trial ordering, method@80, which asks whether a method computed from one trial already agrees with the same method computed from eighty. Agreement is measured with Kendall's τb\tau_b (tie-aware), so τb=1\tau_b=1 means an identical order.

    The three representations

    Every method we study consumes R\mathbf{R} but operates on a projection of it.

    Pointwise (model–question). Methods work on the matrix P^=[p^lm][0,1]L×M\widehat{\mathbf{P}}=[\widehat{p}_{lm}]\in[0,1]^{L\times M} or its row means. Mean accuracy, inverse-difficulty weighting, and IRT-style models live here; when N>1N>1 the trial axis is a stack of repeated Bernoulli observations with sufficient statistic klm:=nRlmnk_{lm}:=\sum_n R_{lmn}, giving a binomial-response model. Evaluation metrics such as Pass@kk and Bayes@N\mathrm{Bayes}@N additionally use the per-question trial multiset before averaging over questions.

    Pairwise (win / tie). For an ordered pair (i,j)(i,j) define

    Wij:=m,n1{Rimn=1,Rjmn=0},Tij:=m,n1{Rimn=Rjmn},W_{ij} := \sum_{m,n} \mathbf{1}\{R_{imn}=1, R_{jmn}=0\}, \qquad T_{ij} := \sum_{m,n} \mathbf{1}\{R_{imn}=R_{jmn}\},

    so that Wij+Wji+Tij=MNW_{ij}+W_{ji}+T_{ij}=MN for every iji\neq j. In our fully-observed setting the comparison graph is complete, unlike Chatbot Arena, where the graph is sparse and evolving. Bradley–Terry and its tie extensions, Borda and Copeland, and graph/spectral methods (PageRank, Rank Centrality, HodgeRank, Nash averaging) all consume (Wij,Tij)(W_{ij},T_{ij}), typically via the tied-split win rate P^ij=(Wij+12Tij)/(Wij+Wji+Tij)\widehat{P}_{i\succ j}=(W_{ij}+\tfrac12 T_{ij})/(W_{ij}+W_{ji}+T_{ij}). Elo and TrueSkill instead replay the underlying stream of question–trial "matches."

    Setwise (winner sets). For each question–trial (m,n)(m,n) the winner set Umn:={l:Rlmn=1}U_{mn}:=\{l:R_{lmn}=1\} ties above its complement. Plackett–Luce and Davidson–Luce operate on the collection {(Umn,LUmn)}\{(U_{mn}, \mathcal{L}\setminus U_{mn})\}, discarding the all-solved and none-solved events that carry no ranking information.

    A consequence worth stating plainly: even as MM or NN grows, these methods need not converge to a single limiting order. Probabilistic paired-comparison models can emphasize different aspects of performance than an expected-accuracy metric, which is why "compute more trials" does not by itself make the choice of ranking method moot.

    With 80 trials, the ranking method barely matters

    This is the reassuring half of the result, and the one I did not expect. When every method gets the full N=80N=80, they agree with the accuracy-based gold standard, and largely with each other. The mean Kendall's τb\tau_b between BayesU@80\mathrm{Bayes}_{\mathcal{U}}@80 and the other 7171 methods is 0.930.930.950.95 per benchmark, the median is 0.950.950.990.99, and a large block of methods reproduces the exact same ordering.

    BenchmarkMean τb\tau_bMedianMin#(τb=1\tau_b=1)#(τb0.95\tau_b\ge 0.95)
    AIME'240.9410.9890.6822040
    AIME'250.9340.9470.7711929
    HMMT'250.9500.9890.7583444
    BrUMO'250.9540.9680.7892649
    Combined0.9620.9890.7482253

    Statistics are over the other 7171 methods, all computed from the full 8080 trials. The stragglers are a handful of voting rules (minimax and Nanson variants) and difficulty-weighted baselines.

    The takeaway is a default: if you can afford a large trial budget, pick the simple, interpretable option. BayesU@N\mathrm{Bayes}_{\mathcal{U}}@N is exactly average accuracy in ranking terms, plus uncertainty for free. The exotic machinery neither helps nor hurts once the data is plentiful.

    The choice only bites at one trial

    Cut the budget to a single trial per question and the methods separate. We subsample one of the 8080 trials, recompute every ranking, repeat over all 8080 single-trial draws, and report the mean τb±\tau_b \pm its standard deviation. (Pass@kk needs at least two trials, so 6969 methods remain at N=1N=1.) Now the best method depends on which target you name: agreement with the accuracy gold standard, or self-consistency with a method's own full-budget order.

    BenchmarkBest vs. gold standardτb\tau_bBest self-consistencyτb\tau_b
    AIME'24BayesR0@1\mathrm{Bayes}_{\mathbf{R}_0}@10.779±0.0340.779 \pm 0.034Rasch MML (LCB)0.804±0.0510.804 \pm 0.051
    AIME'25BayesR0@1\mathrm{Bayes}_{\mathbf{R}_0}@10.798±0.0450.798 \pm 0.045Rasch MML (LCB)0.834±0.0540.834 \pm 0.054
    HMMT'25BayesU@1\mathrm{Bayes}_{\mathcal{U}}@1 (+20 tied)0.790±0.0530.790 \pm 0.053Rasch MML (LCB)0.810±0.0560.810 \pm 0.056
    BrUMO'25BayesR0@1\mathrm{Bayes}_{\mathbf{R}_0}@10.858±0.0280.858 \pm 0.028BayesR0@1\mathrm{Bayes}_{\mathbf{R}_0}@10.858±0.0280.858 \pm 0.028
    CombinedBayesU@1\mathrm{Bayes}_{\mathcal{U}}@1 (+20 tied)0.865±0.0490.865 \pm 0.049Nanson (avg ties)0.892±0.0500.892 \pm 0.050

    Two things stand out. First, high self-consistency does not imply closeness to the accuracy order: Nanson's rule is the most repeatable method on the combined benchmark (τb=0.892\tau_b=0.892) yet trails badly on gold-standard agreement (0.8070.807). A method can converge cleanly to its own answer while that answer drifts away from accuracy. Second, the low-budget winner is often the Bayesian estimator with a greedy prior, which turns out to be a double-edged tool.

    These conclusions are not an artifact of the particular 2020 models. Re-running the N=1N=1 analysis on 10001000 bootstrapped model pools of size 55, 1010, and 1515 keeps the same winners; larger pools mainly shrink the between-subset spread. On AIME'24, the across-subset standard deviation of the top method falls from 0.2090.209 at five models to 0.0570.057 at fifteen. A bigger pool does not change the recommendation; it makes it more certain.

    A greedy prior buys stability, but can move the ranking

    The most useful and most dangerous knob in the low-budget regime is the empirical prior. Alongside the 8080 stochastic trials, we collect one greedy decode per question, R0\mathbf{R}_0, and fold it into the posterior as pseudo-counts. That gives BayesR0@N\mathrm{Bayes}_{\mathbf{R}_0}@N, which is just Bayes@N\mathrm{Bayes}@N shrunk toward the greedy ordering.

    Shrinkage does what shrinkage always does: it trades variance for bias. At N=1N=1 the greedy prior reduces the standard deviation of τb\tau_b by 161652%52\% depending on the benchmark. The advantage fades quickly as NN grows, because a single greedy run only contributes O(1)O(1) pseudo-counts per question.

    Gold-standard agreement of uniform vs. greedy-prior Bayes rankings as the trial budget grows, per benchmark

    The catch is the mean. Cutting variance is only good if you were aiming at the right target. On three benchmarks the greedy prior nudges the mean τb\tau_b up; on HMMT'25, the hardest one, it pushes the ranking away from the accuracy order. The sign of the shift tracks a single diagnostic: how well greedy decoding and stochastic sampling already agree.

    BenchmarkDifficultyτG-S\tau_{\text{G-S}}Δτ\Delta\tau (greedy − uniform)Std. reduction
    AIME'240.6200.739+0.02042%
    AIME'250.5330.660+0.00817%
    HMMT'250.3330.635−0.02216%
    BrUMO'250.5880.768+0.04952%

    τG-S\tau_{\text{G-S}} is the greedy–sampling rank alignment: Kendall's τb\tau_b between the ranking induced by greedy decoding and by stochastic sampling at N=80N=80. Higher alignment goes with a more positive Δτ\Delta\tau. The prior helps most on BrUMO'25 (aligned, +0.049+0.049) and hurts on HMMT'25 (least aligned, 0.022-0.022).

    # One greedy decode per question, shared across models: shape (M, D)
    R0 = np.load("greedy.npy")               # shape (30, 1)
     
    ranking_uniform = rank.bayes(R)          # Bayes_U@N
    ranking_greedy  = rank.bayes(R, R0=R0)   # Bayes_R0@N, greedy empirical prior
     
    # Conservative, uncertainty-aware ranking via a posterior lower bound
    ranking_lcb = rank.bayes(R, R0=R0, quantile=0.05)

    The mechanism is intuitive once you see the scatter of greedy vs. sampled ranks below. Greedy decoding under-explores on hard instances, where stochastic sampling can still stumble onto a correct chain. When the two policies rank models the same way, the prior is free stabilization; when they diverge, it quietly biases the leaderboard toward greedy behavior. That is why the prior is not free information; it changes the target unless you have checked the alignment on a small pilot.

    Model ranks under greedy decoding vs. stochastic sampling, per benchmark

    Beyond binary: categorical ranking has the same trap

    The Bayes@N\mathrm{Bayes}@N estimator is not limited to right/wrong. Map each completion to one of C+1C+1 ordered categories, using signals like boxed-vs-unboxed answers, confidence (bits per token), token efficiency, or an external verifier, and attach a utility weight vector w=(w0,,wC)\mathbf{w}=(w_0,\dots,w_C). Bayesian estimation then runs on a Dirichlet–multinomial model instead of Beta–binomial, and rank.bayes(R_cat, w=w) ranks by the weighted posterior mean.

    These auxiliary signals reproduce the self-consistency trap in a sharper form. Across categorical schemes at N=1N=1 on the combined benchmark, the signal-rich schemes are the most self-consistent (Verifier-only reaches τSelf=0.897\tau_{\text{Self}}=0.897) yet the least faithful to the accuracy gold standard (τGS=0.824\tau_{\text{GS}}=0.824). The trade-off is a clean negative correlation: the more a scheme leans on auxiliary signals, the more stable and the more biased it becomes.

    Gold-standard agreement vs. self-consistency across categorical schemes

    The practical rule that falls out is a reporting discipline: state the category mapping and the utility weights, and never read higher self-consistency as higher accuracy fidelity. All eight representative schemes also correlate more strongly with the greedy-prior reference than the uniform one, the same mechanism as the empirical prior: verifier and format signals partly encode greedy-decoding behavior.

    What to actually report

    The paper is not a case for one exotic leaderboard. It is a case for a small amount of discipline once evaluation becomes a repeated-sampling problem:

    • Name the target. Accuracy agreement, self-consistency, and task-utility rankings are different objects. Decide which one you want before declaring a winner.
    • Report stability, not just a point ranking. At low budget, a leaderboard needs τb\tau_b, an uncertainty estimate, and convergence as NN grows.
    • Use BayesU@N\mathrm{Bayes}_{\mathcal{U}}@N as the default. It is transparent, order-equivalent to accuracy, and uncertainty-aware. Reach for a greedy prior only after checking greedy–sampling alignment on a pilot sample, and audit any auxiliary signal the same way.

    Abstract

    Test-time scaling evaluates reasoning LLMs by sampling multiple outputs per prompt, but ranking models in this regime remains underexplored. We formalize dense benchmark ranking under test-time scaling and introduce Scorio, a library that implements statistical ranking methods such as paired-comparison models, item response theory (IRT) models, voting rules, and graph- and spectral-based methods. Across 2020 reasoning models on four Olympiad-style math benchmarks (AIME'24, AIME'25, HMMT'25, and BrUMO'25; up to N=80N=80 trials), most full-trial rankings agree closely with the Bayesian gold standard BayesU@80\mathrm{Bayes}_{\mathcal{U}}@80 (mean Kendall's τb=0.93\tau_b = 0.930.950.95), and 19193434 methods recover exactly the same ordering. In the single-trial regime, the best methods reach τb0.86\tau_b \approx 0.86. Using greedy decoding as an empirical prior (BayesR0@N\mathrm{Bayes}_{\mathbf{R}_0}@N) reduces variance at N=1N=1 by 161652%52\%, but can bias rankings when greedy and stochastic sampling disagree. These results identify reliable ranking methods for both high- and low-budget test-time scaling. We release Scorio as an open-source library at https://github.com/mohsenhariri/scorio.

    This is the ranking half of the Bayes@N story, and I am still working on it, extending the analysis past binary correctness to partial credit and rubric-based scoring. If you want to compare notes or you are running your own repeated-trial evaluations, my email is in the footer.

    Citation

    @article{hariri2026ranking,
      title={Ranking reasoning LLMs under test-time scaling},
      author={Hariri, Mohsen and Hinczewski, Michael and Ma, Jing and Chaudhary, Vipin},
      journal={arXiv preprint arXiv:2603.10960},
      year={2026}
    }

    Acknowledgments

    This research was supported in part by NSF awards 2117439, 2112606, and 2320952.

    Contact

    For questions or correspondence, you can find my email in the footer.