Test-Time Scaling Under Budget
M.Sc. Thesis in Computer Science
4 items tagged with "Reasoning"
M.Sc. Thesis in Computer Science
A principled Bayesian framework that replaces Pass@k with posterior estimates, credible intervals, and stable rankings for LLM evaluation
Explore how Item Response Theory (IRT) and other psychometric models can simulate and analyze LLM evaluation datasets. Learn how difficulty, discrimination, and guessing parameters reveal model reasoning patterns, with interactive examples across multiple reading levels.
A Bayesian framework for evaluating large language models that replaces unstable Pass@k metrics with posterior estimates and credible intervals. The method improves sample efficiency, supports graded outcomes, and enables statistically sound model comparisons.