π Ranking Reasoning LLMs under Test-Time Scaling Accepted to ACL 2026 Main

Hi, I'm Mohsen!
and I love math β€οΈ.
News
View all βπ Quantize What Counts: More for Keys, Less for Values Accepted to ACL 2026 Findings
π² Donβt Pass@π: A Bayesian Framework for Large Language Model Evaluation Accepted to ICLR 2026
π¦ Julia & Python pkgs for the Bayesian framework are out!
π¦ vLLM Γ DFloat11: run your model with 30% less memory!
β¨ 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float Accepted to NeurIPS 2025
Recent Papers
View all βDonβt Pass@π: A Bayesian Framework for Large Language Model Evaluation
Mohsen Hariri, Amirhossein Samandar, Michael Hinczewski, Vipin Chaudhary β’
Quantize What Counts: More For Keys, Less For Values βοΈπππ’
Mohsen Hariri, Alan Luo, Weicong Chen, Shaochen Zhong, Tianyi Zhang, Qifan Wang, Xia Hu, Xiaotian Han, Vipin Chaudhary β’
70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float
Tianyi Zhang, Mohsen Hariri, Shaochen Zhong, Vipin Chaudhary, Yang Sui, Xia Hu, Anshumali Shrivastali β’
Recent Posters
View all βRanking Reasoning LLMs under Test-Time Scaling
ACL 2026 Mainβ’ Mohsen Hariri, Michael Hinczewski, Jing Ma, Vipin Chaudhary
Ranking reasoning LLMs under repeated sampling, comparing 72 ranking methods across four Olympiad-style math benchmarks and packaging them in Scorio.
Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation
ICLR 2026β’ Mohsen Hariri, Amirhossein Samandar, Michael Hinczewski, Vipin Chaudhary
Proposed a Bayesian framework that estimates models' success probabilities with quantified uncertainty, yielding more reliable rankings and enabling categorical evaluation of LLMs.
Recent Posts
View all βEntropy of bfloat16 During Training: How Optimizers Shape Weight Distributions
β’ Training, Information Theory, Optimizers
Entropy of bfloat16: 8 Bits Are Doing 2.6 Bits of Work
β’ LLMs, Information Theory, Efficiency
Simulating LLM Evaluation Datasets Using Psychometric Models
β’ Simulation, LLMs, Reasoning
Recent Slides
View all βVirtual Agentic Lab!
β’ AI Agents, LLMs, Science
10-slide paper summary of Swanson et al. (doi:10.1038/s41586-025-09442-9)
LLM Research Directions
β’ LLMs, Reasoning Models, Test-time scaling
SCIPE Workshop on LLMs - Day 3
Tool Use (Function Calling) & RAG
β’ LLMs, Tools, Function Calling
SCIPE Workshop on LLMs - Day 2