Donβt Pass@π: A Bayesian Framework for Large Language Model Evaluation
Mohsen Hariri, Amirhossein Samandar, Michael Hinczewski, Vipin Chaudhary
A Bayesian framework for evaluating large language models that replaces unstable Pass@k metrics with robust posterior estimates and credible intervals. This method improves sample efficiency, supports graded outcomes, and enables statistically sound model comparisons.