Ranking Reasoning LLMs under Test-Time Scaling

Mohsen Hariri Michael Hinczewski Jing Ma Vipin Chaudhary

Scan for the interactive poster

Test-Time Scaling & Ranking

  • The primitive object is a response tensor.
  • Each model-question pair can be sampled repeatedly, so rankings change as the trial budget grows.
  • Methods that look similar at full budget can behave very differently when only one or two trials are available.
  1. Collect repeated trials per question
  2. Apply one ranking family to the tensor
  3. Use BayesU@80\mathrm{Bayes}_{\mathcal{U}}@80 as the gold standard
  4. Measure agreement and convergence as N grows
  5. Prefer rules that stay stable at low budget

Distinction: high-budget consensus tells us what methods eventually agree on; low-budget stability tells us what you can trust in practice.

from scorio import rank

rank.avg(R)
rank.bayes(R, R0=None, quantile=None)
rank.pass_at_k(R, k=3)
rank.bradley_terry(R)
rank.pagerank(R)

Best Low-Budget Methods

BayesR0@1\mathrm{Bayes}_{R0}@1 wins when greedy decoding is aligned.

Gold target

Closest to BayesU@80\mathrm{Bayes}_{\mathcal{U}}@80

τb\tau_b
AIME'24 BayesR0@1\mathrm{Bayes}_{R0}@1 0.779 ± 0.034
AIME'25 BayesR0@1\mathrm{Bayes}_{R0}@1 0.798 ± 0.045
HMMT'25 BayesU@1\mathrm{Bayes}_{\mathcal{U}}@1 0.790 ± 0.053
BrUMO'25 BayesR0@1\mathrm{Bayes}_{R0}@1 0.858 ± 0.028
Combined BayesU@1\mathrm{Bayes}_{\mathcal{U}}@1 0.865 ± 0.049
Repeatability

Best self-consistency

τb\tau_b
AIME'24 Rasch MML LCB 0.804 ± 0.051
AIME'25 Rasch MML LCB 0.834 ± 0.054
HMMT'25 Rasch MML LCB 0.810 ± 0.056
BrUMO'25 BayesR0@1\mathrm{Bayes}_{R0}@1 0.858 ± 0.028
Combined Nanson avg ties 0.892 ± 0.050

Gold agreement: BayesR0@1\mathrm{Bayes}_{R0}@1 leads on aligned benchmarks; BayesU@1\mathrm{Bayes}_{\mathcal{U}}@1 handles drift.

Self-consistency: repeatable winners can differ from correctness-based winners.

Trial Budget Shapes Reliability

Gold-standard agreement curves as the number of stochastic trials increases
  • The best methods reach Kendall tau near 0.86 on the combined benchmark at a single trial.
  • BayesR0@1\mathrm{Bayes}_{R0}@1 wins on AIME'24, AIME'25, and BrUMO'25, but not on HMMT'25.
  • At N = 1, the greedy prior reduces the standard deviation of agreement by 16% to 52%.
  • The mean shift is positive on AIME'24, AIME'25, and BrUMO'25, but negative on HMMT'25 and Combined.

High-Budget Ranking

Bump charts showing how ranking methods agree with Bayes U at full budget
  • Across the four benchmarks, mean Kendall tau versus BayesU@80\mathrm{Bayes}_{\mathcal{U}}@80 is 0.93 to 0.95.
  • Depending on the benchmark, 19 to 34 methods recover exactly the same full-trial ordering.
  • Divergence concentrates on harder benchmarks and a small set of voting or difficulty-weighted rules.

Alignment Explains When Prior Helps

Rank-alignment diagnostics comparing greedy and stochastic sampling rankings
  • Greedy-sampling agreement is strongest on BrUMO and weakest on HMMT.
  • BrUMO is the easiest benchmark and shows the largest positive prior effect.
  • When greedy decoding ceases to be a faithful proxy, shrinkage becomes bias.

Prior helps whengreedy and sampled ranks align

Prior hurts whengreedy decoding drifts

Categorical Schemes Trade Off Fidelity & Self-Consistency

Tradeoff between gold-standard agreement and self-consistency for categorical schemes
  • Verifier-only ranking is more self-consistent, but it drifts away from the correctness-based gold standard.
  • Conservative ranking stays closer to correctness while giving up some self-consistency.
  • Categorical Bayes ranking is powerful, but the rubric has to be reported and justified.

Choose by targetfidelity or repeatability

Always reportmapping and utility weights

Takeaway: use Bayes@N\mathrm{Bayes}@N for ranking that is:

  • more stable
  • uncertainty-aware
  • category-aware
  • prior-informed

Acknowledgment: This research was supported in part by NSF awards 2117439, & 2320952.