Case Western Reserve University
Scorio National Science Foundation
Poster detail level

Ranking Reasoning LLMs under Test-Time Scaling

Mohsen Hariri Michael Hinczewski Jing Ma Vipin Chaudhary

Why ranking under test-time scaling is different

  • The primitive object is a response tensor, not a single accuracy number.
  • Each model-question pair can be sampled repeatedly, so rankings change as the trial budget grows.
  • Methods that look similar at full budget can behave very differently when only one or two trials are available.
  1. 1 Collect repeated trials per question
  2. 2 Apply one ranking family to the tensor
  3. 3 Use BayesU@80 as the shared gold standard
  4. 4 Measure agreement and convergence as N changes
  5. 5 Prefer rules that stay stable at low budget

Main distinction: high-budget consensus tells you what methods eventually agree on; low-budget stability tells you what you can trust in practice.

At N = 80, most reasonable methods agree

Bump charts showing how 72 ranking methods agree with Bayes_U at full budget on BrUMO and HMMT
Agreement between full-trial rankings and Bayes_U@80 on an easier benchmark and a harder benchmark.
  • Across the four benchmarks, the mean Kendall τb versus BayesU@80 is 0.93 to 0.95.
  • Depending on the benchmark, 19 to 34 methods recover exactly the same full-trial ordering.
  • Divergence concentrates on harder benchmarks and a small set of voting or difficulty-weighted rules.

Most of the practical risk is at N = 1

Agreement with the Bayes_U gold standard as the number of stochastic trials increases
Agreement improves as the trial budget grows, with the steepest gains in the first few trials.
  • The best methods reach τb ≈ 0.86 on the combined benchmark at a single trial.
  • BayesG@1 wins on AIME'24, AIME'25, and BrUMO'25, but not on HMMT'25 or the pooled benchmark.
  • High self-consistency does not imply closeness to the correctness-based gold standard.

Recommendation
BayesU@N is the safe default; add a greedy prior only after checking alignment.

Greedy priors reduce variance, but can also inject bias

Difference in gold-standard agreement between greedy and uniform priors across benchmarks
Greedy priors help on some easier or aligned benchmarks and hurt on the hardest or pooled settings.
  • At N = 1, the greedy prior reduces the standard deviation of agreement by 16% to 52%.
  • The mean shift is positive on AIME'24, AIME'25, and BrUMO'25, but negative on HMMT'25 and Combined.
  • The advantage decays quickly as new stochastic evidence accumulates.

Alignment explains when the prior helps

Rank-alignment diagnostics comparing greedy-decoding rankings with stochastic-sampling rankings
Greedy-sampling agreement is strongest on BrUMO and weakest on HMMT.
  • BrUMO is the easiest benchmark and shows the largest positive prior effect.
  • HMMT is harder, less aligned, and is the setting where BayesG@N hurts agreement with BayesU@80.
  • Difficulty and alignment move together: when greedy decoding ceases to be a faithful proxy, shrinkage becomes bias.

Categorical schemes trade off fidelity and self-consistency

Scatter plot showing the tradeoff between agreement with the gold standard and self-consistency for categorical schemes
Schemes in the upper-left are stable but drift away from correctness; schemes in the lower-right stay closer to the gold standard.
  • On the combined benchmark at N = 1, conservative ranking reaches τGS = 0.856 and τSelf = 0.861.
  • Verifier-only ranking is more self-consistent at 0.897, but it drops to 0.824 against the gold standard.
  • Richer signals are useful when they reflect the real target, not when they are treated as a free upgrade to correctness-only ranking.

Takeaway
Categorical Bayes ranking is powerful, but the rubric has to be reported and justified.