Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation
ICLR 2026• Mohsen Hariri, Amirhossein Samandar, Michael Hinczewski, Vipin Chaudhary
Proposed a Bayesian framework that estimates models' success probabilities with quantified uncertainty, yielding more reliable rankings and enabling categorical evaluation of LLMs.
