Ranking Reasoning LLMs under Test-Time Scaling
ACL 2026 presentation on ranking reasoning LLMs under test-time scaling: dense repeated-trial evaluation, Bayes@N as a practical default, low-budget priors, categorical ranking, and the Scorio toolkit.
ACL 2026 presentation on ranking reasoning LLMs under test-time scaling: dense repeated-trial evaluation, Bayes@N as a practical default, low-budget priors, categorical ranking, and the Scorio toolkit.
Serving reasoning LLMs efficiently and reliably: lossless DFloat11 compression, KV-cache quantization, and Bayes@N evaluation and ranking under test-time scaling.
Serving reasoning LLMs efficiently and reliably: lossless DFloat11 compression, KV-cache quantization, and Bayes@N evaluation and ranking under test-time scaling.
Python environments, how to create and reproduce them, and when to use pip, conda, micromamba, uv, pipx, lockfiles, and containers.
Ranking reasoning LLMs under repeated sampling, comparing 72 ranking methods across four Olympiad-style math benchmarks and packaging them in Scorio.
ICLR 2026 presentation on Don't Pass@k: a Bayesian evaluation framework (Bayes@N) with Dirichlet posteriors, credible intervals, a non-overlap decision rule, categorical rubric scoring, and the Scorio toolkit.
Proposed a Bayesian framework that estimates models' success probabilities with quantified uncertainty, yielding more reliable rankings and enabling categorical evaluation of LLMs.
Explore how Item Response Theory (IRT) and other psychometric models can simulate and analyze LLM evaluation datasets. Learn how difficulty, discrimination, and guessing parameters reveal model reasoning patterns, with interactive examples across multiple reading levels.
Explore how simulating LLM responses to evaluation datasets with stochastic sampling is like flipping biased coins—revealing variability, bias, and the importance of multiple trials for reliable benchmarking.
A Bayesian framework for evaluating large language models that replaces unstable Pass@k metrics with posterior estimates and credible intervals. The method improves sample efficiency, supports graded outcomes, and enables statistically sound model comparisons.