Scorio

Tag: Scorio

5 items tagged with "Scorio"

Ranking Reasoning LLMs under Test-Time Scaling

July 6, 2026

ACL 2026 presentation on ranking reasoning LLMs under test-time scaling: dense repeated-trial evaluation, Bayes@N as a practical default, low-budget priors, categorical ranking, and the Scorio toolkit.

Paper

Ranking Reasoning LLMs under Test-Time Scaling

Mohsen Hariri, Michael Hinczewski, Jing Ma, Vipin Chaudhary

April 6, 2026

Ranking reasoning LLMs under test-time scaling. We compare 72 ranking methods (Bayes@N, Bradley-Terry, Elo, IRT, voting, graph/spectral) across 20 models and four Olympiad-style math benchmarks. At full budget they mostly agree (Kendall's tau_b 0.93-0.95); at one trial a greedy prior cuts variance 16-52% but can bias the ranking. Packaged in the Scorio toolkit.

Poster

Ranking Reasoning LLMs under Test-Time Scaling

Mohsen Hariri, Michael Hinczewski, Jing Ma, Vipin Chaudhary

acl-2026

April 6, 2026

Ranking reasoning LLMs under repeated sampling, comparing 72 ranking methods across four Olympiad-style math benchmarks and packaging them in Scorio.

Slide

Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation

January 25, 2026

ICLR 2026 presentation on Don't Pass@k: a Bayesian evaluation framework (Bayes@N) with Dirichlet posteriors, credible intervals, a non-overlap decision rule, categorical rubric scoring, and the Scorio toolkit.

Poster

Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation

Mohsen Hariri, Amirhossein Samandar, Michael Hinczewski, Vipin Chaudhary

iclr-2026

January 25, 2026

Proposed a Bayesian framework that estimates models' success probabilities with quantified uncertainty, yielding more reliable rankings and enabling categorical evaluation of LLMs.