Ranking Reasoning LLMs under Test-Time Scaling
Ranking reasoning LLMs under repeated sampling, comparing 72 ranking methods across four Olympiad-style math benchmarks and packaging them in Scorio.
Ranking reasoning LLMs under repeated sampling, comparing 72 ranking methods across four Olympiad-style math benchmarks and packaging them in Scorio.
Proposed a Bayesian framework that estimates models' success probabilities with quantified uncertainty, yielding more reliable rankings and enabling categorical evaluation of LLMs.
Explore how Item Response Theory (IRT) and other psychometric models can simulate and analyze LLM evaluation datasets. Learn how difficulty, discrimination, and guessing parameters reveal model reasoning patterns, with interactive examples across multiple reading levels.
Explore how simulating LLM responses to evaluation datasets with stochastic sampling is like flipping biased coins—revealing variability, bias, and the importance of multiple trials for reliable benchmarking.
A Bayesian framework for evaluating large language models that replaces unstable Pass@k metrics with posterior estimates and credible intervals. The method improves sample efficiency, supports graded outcomes, and enables statistically sound model comparisons.