Test-Time Scaling

Tag: Test-Time Scaling

8 items tagged with "Test-Time Scaling"

Ranking Reasoning LLMs under Test-Time Scaling

July 6, 2026

ACL 2026 presentation on ranking reasoning LLMs under test-time scaling: dense repeated-trial evaluation, Bayes@N as a practical default, low-budget priors, categorical ranking, and the Scorio toolkit.

Slide

Serving Reasoning LLMs Efficiently and Reliably [No Anime]

July 6, 2026

Serving reasoning LLMs efficiently and reliably: lossless DFloat11 compression, KV-cache quantization, and Bayes@N evaluation and ranking under test-time scaling.

Slide

Serving Reasoning LLMs Efficiently and Reliably

July 6, 2026

Serving reasoning LLMs efficiently and reliably: lossless DFloat11 compression, KV-cache quantization, and Bayes@N evaluation and ranking under test-time scaling.

Poster

Ranking Reasoning LLMs under Test-Time Scaling

Mohsen Hariri, Michael Hinczewski, Jing Ma, Vipin Chaudhary

ACL 2026 Main

April 6, 2026

Ranking reasoning LLMs under repeated sampling, comparing 72 ranking methods across four Olympiad-style math benchmarks and packaging them in Scorio.

Slide

Don't Pass@k: A Bayesian Framework for LLM Evaluation

January 25, 2026

ICLR 2026 presentation on Don't Pass@k: a Bayesian evaluation framework (Bayes@N) with Dirichlet posteriors, credible intervals, a non-overlap decision rule, categorical rubric scoring, and the Scorio toolkit.

Poster

Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation

Mohsen Hariri, Amirhossein Samandar, Michael Hinczewski, Vipin Chaudhary

ICLR 2026

January 25, 2026

Proposed a Bayesian framework that estimates models' success probabilities with quantified uncertainty, yielding more reliable rankings and enabling categorical evaluation of LLMs.

Slide

LLM Research Directions

January 18, 2026

SCIPE Workshop on LLMs - Day 3

Paper

Don’t Pass@𝑘: A Bayesian Framework for Large Language Model Evaluation

Mohsen Hariri, Amirhossein Samandar, Michael Hinczewski, Vipin Chaudhary

October 21, 2025

A Bayesian framework for evaluating large language models that replaces unstable Pass@k metrics with posterior estimates and credible intervals. The method improves sample efficiency, supports graded outcomes, and enables statistically sound model comparisons.