Ranking Reasoning LLMs under Test-Time Scaling
ACL 2026 presentation on ranking reasoning LLMs under test-time scaling: dense repeated-trial evaluation, Bayes@N as a practical default, low-budget priors, categorical ranking, and the Scorio toolkit.
8 items tagged with "Test-Time Scaling"
ACL 2026 presentation on ranking reasoning LLMs under test-time scaling: dense repeated-trial evaluation, Bayes@N as a practical default, low-budget priors, categorical ranking, and the Scorio toolkit.
Serving reasoning LLMs efficiently and reliably: lossless DFloat11 compression, KV-cache quantization, and Bayes@N evaluation and ranking under test-time scaling.
Serving reasoning LLMs efficiently and reliably: lossless DFloat11 compression, KV-cache quantization, and Bayes@N evaluation and ranking under test-time scaling.
Ranking reasoning LLMs under repeated sampling, comparing 72 ranking methods across four Olympiad-style math benchmarks and packaging them in Scorio.
ICLR 2026 presentation on Don't Pass@k: a Bayesian evaluation framework (Bayes@N) with Dirichlet posteriors, credible intervals, a non-overlap decision rule, categorical rubric scoring, and the Scorio toolkit.
Proposed a Bayesian framework that estimates models' success probabilities with quantified uncertainty, yielding more reliable rankings and enabling categorical evaluation of LLMs.
SCIPE Workshop on LLMs - Day 3
A Bayesian framework for evaluating large language models that replaces unstable Pass@k metrics with posterior estimates and credible intervals. The method improves sample efficiency, supports graded outcomes, and enables statistically sound model comparisons.