Ranking Reasoning LLMs under Test-Time Scaling

Test-Time Scaling & Ranking

The primitive object is a response tensor.
Each model-question pair can be sampled repeatedly, so rankings change as the trial budget grows.
Methods that look similar at full budget can behave very differently when only one or two trials are available.

Collect repeated trials per question
Apply one ranking family to the tensor
Use $\mathrm{Bayes}_{\mathcal{U}}@80$ as the gold standard
Measure agreement and convergence as N grows
Prefer rules that stay stable at low budget

Distinction: high-budget consensus tells us what methods eventually agree on; low-budget stability tells us what you can trust in practice.

from scorio import rank

rank.avg(R)
rank.bayes(R, R0=None, quantile=None)
rank.pass_at_k(R, k=3)
rank.bradley_terry(R)
rank.pagerank(R)

Best Low-Budget Methods

$\mathrm{Bayes}_{R0}@1$ wins when greedy decoding is aligned.

Gold target

Closest to $\mathrm{Bayes}_{\mathcal{U}}@80$

\tau_b

AIME'24

\mathrm{Bayes}_{R0}@1

0.779 ± 0.034

AIME'25

\mathrm{Bayes}_{R0}@1

0.798 ± 0.045

HMMT'25

\mathrm{Bayes}_{\mathcal{U}}@1

0.790 ± 0.053

BrUMO'25

\mathrm{Bayes}_{R0}@1

0.858 ± 0.028

Combined

\mathrm{Bayes}_{\mathcal{U}}@1

0.865 ± 0.049

Repeatability

Best self-consistency

\tau_b

AIME'24 Rasch MML LCB 0.804 ± 0.051

AIME'25 Rasch MML LCB 0.834 ± 0.054

HMMT'25 Rasch MML LCB 0.810 ± 0.056

BrUMO'25

\mathrm{Bayes}_{R0}@1

0.858 ± 0.028

Combined Nanson avg ties 0.892 ± 0.050

Gold agreement: $\mathrm{Bayes}_{R0}@1$ leads on aligned benchmarks; $\mathrm{Bayes}_{\mathcal{U}}@1$ handles drift.

Self-consistency: repeatable winners can differ from correctness-based winners.

Trial Budget Shapes Reliability

Gold-standard agreement curves as the number of stochastic trials increases

The best methods reach Kendall tau near 0.86 on the combined benchmark at a single trial.
$\mathrm{Bayes}_{R0}@1$ wins on AIME'24, AIME'25, and BrUMO'25, but not on HMMT'25.
At N = 1, the greedy prior reduces the standard deviation of agreement by 16% to 52%.
The mean shift is positive on AIME'24, AIME'25, and BrUMO'25, but negative on HMMT'25 and Combined.

High-Budget Ranking

Bump charts showing how ranking methods agree with Bayes U at full budget

Across the four benchmarks, mean Kendall tau versus $\mathrm{Bayes}_{\mathcal{U}}@80$ is 0.93 to 0.95.
Depending on the benchmark, 19 to 34 methods recover exactly the same full-trial ordering.
Divergence concentrates on harder benchmarks and a small set of voting or difficulty-weighted rules.

Alignment Explains When Prior Helps

Rank-alignment diagnostics comparing greedy and stochastic sampling rankings

Greedy-sampling agreement is strongest on BrUMO and weakest on HMMT.
BrUMO is the easiest benchmark and shows the largest positive prior effect.
When greedy decoding ceases to be a faithful proxy, shrinkage becomes bias.

Prior helps whengreedy and sampled ranks align

Prior hurts whengreedy decoding drifts

Categorical Schemes Trade Off Fidelity & Self-Consistency

Tradeoff between gold-standard agreement and self-consistency for categorical schemes

Verifier-only ranking is more self-consistent, but it drifts away from the correctness-based gold standard.
Conservative ranking stays closer to correctness while giving up some self-consistency.
Categorical Bayes ranking is powerful, but the rubric has to be reported and justified.

Choose by targetfidelity or repeatability

Always reportmapping and utility weights

Takeaway: use $\mathrm{Bayes}@N$ for ranking that is:

more stable
uncertainty-aware
category-aware
prior-informed

Acknowledgment: This research was supported in part by NSF awards 2117439, & 2320952.

Setup

Scorio treats repeated-trial evaluation as a tensor R in {0, 1, ..., C}^{L x M x N}, where L is the model pool, M is the number of benchmark questions, and N is the number of stochastic trials per question.

The same input powers pointwise estimators such as avg and bayes, paired-comparison methods such as bradley_terry and elo, graph methods such as pagerank, and voting rules such as nanson.

The practical question is not just which method ranks well at N = 80, but which one remains close to the full-budget ranking when the budget is tiny.

Reference ranking: use Bayes_U@80 as the correctness-based gold standard.
Low-budget stress test: recompute rankings after subsampling one trial per question.
Priors: priors and rich signals can stabilize rankings, but they can also move the target.

Response Tensor R.shape == (L, M, N)

Low-Budget Prior R0.shape == (M, D)

Categorical Weights w.shape == (C + 1,)

Output (rankings[, scores])

Function	Family	Use
`rank.avg(R)`	Pointwise	Mean correctness across repeated trials.
`rank.bayes(R, R0=None)`	Bayesian	Posterior ranking with optional empirical prior and categorical weights.
`rank.pass_at_k(R, k=3)`	Metric-based	At-least-one-success ranking for a fixed draw budget.
`rank.bradley_terry(R)`	Pairwise	Latent-strength ranking from decisive wins and losses.
`rank.pagerank(R)`	Graph	Ranking from a pairwise win-probability graph.

Rankings are 1-indexed, and return_scores=True exposes the underlying score vector.

import numpy as np
from scorio import rank
 
# L models, M questions, N stochastic trials
R = np.random.randint(0, 2, size=(20, 30, 8))
 
avg_ranks, avg_scores = rank.avg(R, return_scores=True)
bayes_ranks = rank.bayes(R)             # Bayes_U@N
bt_ranks = rank.bradley_terry(R)
graph_ranks = rank.pagerank(R)

All methods read the same tensor; only the ranking rule changes.
Use return_scores=True when you need the score vector behind the ranking.
Seventy-two such rules are studied under the same repeated-trial protocol.

import numpy as np
from scorio import rank
 
R = np.random.randint(0, 2, size=(20, 30, 1))
 
# Shared greedy-decoding prior across models
R0 = np.random.randint(0, 2, size=(30, 1))
 
uniform_ranks = rank.bayes(R) # Bayes_U@1
greedy_ranks = rank.bayes(R, R0=R0) # Bayes_R0@1
conservative = rank.bayes(R, R0=R0, quantile=0.05)

Greedy or pilot outcomes can stabilize low-budget rankings when they are aligned.
Use quantile=0.05 when you want a conservative lower-bound ranking.

import numpy as np
from scorio import rank
 
# 0 = invalid, 1 = wrong, 2 = partial, 3 = correct
R_cat = np.random.randint(0, 4, size=(11, 120, 5))
w = np.array([0.0, 0.0, 0.5, 1.0])
 
scheme_ranks, scheme_scores = rank.bayes(
    R_cat,
    w=w,
    return_scores=True,
)
 
scheme_lcb = rank.bayes(R_cat, w=w, quantile=0.05)

Different labeling schemas lead to different rankings.

Ranking Reasoning LLMs under Test-Time Scaling

Test-Time Scaling & Ranking

Best Low-Budget Methods

Closest to $\mathrm{Bayes}_{\mathcal{U}}@80$

Best self-consistency

Trial Budget Shapes Reliability

High-Budget Ranking

Alignment Explains When Prior Helps

Categorical Schemes Trade Off Fidelity & Self-Consistency

Setup

APIs

Ranking families

Empirical priors

Categorical ranking

Test-Time Scaling & Ranking

Best Low-Budget Methods

Closest to BayesU@80\mathrm{Bayes}_{\mathcal{U}}@80BayesU​@80

Best self-consistency

Trial Budget Shapes Reliability

High-Budget Ranking

Alignment Explains When Prior Helps

Categorical Schemes Trade Off Fidelity & Self-Consistency

Setup

APIs

Ranking families

Empirical priors

Categorical ranking

Closest to $\mathrm{Bayes}_{\mathcal{U}}@80$