Research

MIS-Ground is a controlled factorial benchmark that isolates the language-side brittleness behind 3D medical spatial grounding in vision-language models. MIS-SemSam, a training-free semantic-sampling decode rule, lifts Qwen3-VL-32B by 13.06% to 66.5% overall, the best open-weights result and above Gemini 3 Flash.

Multimodal AI Vision-Language Models Medical Imaging Spatial Grounding Benchmarking Evaluation

Ranking Reasoning LLMs under Test-Time Scaling

Mohsen Hariri, Michael Hinczewski, Jing Ma, Vipin Chaudhary

April 6, 2026

Ranking reasoning LLMs under test-time scaling. We compare 72 ranking methods (Bayes@N, Bradley-Terry, Elo, IRT, voting, graph/spectral) across 20 models and four Olympiad-style math benchmarks. At full budget they mostly agree (Kendall's tau_b 0.93-0.95); at one trial a greedy prior cuts variance 16-52% but can bias the ranking. Packaged in the Scorio toolkit.

Statistics Bayesian LLM Ranking Test-Time Scaling Benchmarking Scorio

Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation

Mohsen Hariri, Amirhossein Samandar, Michael Hinczewski, Vipin Chaudhary

October 21, 2025

A Bayesian framework for evaluating large language models that replaces unstable Pass@k metrics with posterior estimates and credible intervals. The method improves sample efficiency, supports graded outcomes, and enables statistically sound model comparisons.

Statistics Bayesian LLMs Inference Reasoning Simulation Test-Time Scaling

Quantize What Counts: More for Keys, Less for Values

Mohsen Hariri, Alan Luo, Weicong Chen, Tianyi Zhang, Qifan Wang, Xiaotian Han, Vipin Chaudhary

October 20, 2025

Key-favored KV-cache quantization for LLMs: theory shows keys have larger norms and should get more bits; empirics show 4b-K/2b-V preserves up to 98.3% accuracy while cutting memory.

Compression LLMs Efficiency Inference Quantization KV Cache

70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float

Tianyi Zhang, Mohsen Hariri, Shaochen Zhong, Vipin Chaudhary, Yang Sui, Xia Hu, Anshumali Shrivastava

October 19, 2025

DFloat11 compresses LLMs to 70% of their original size while maintaining bit-for-bit identical outputs. A lossless compression framework with efficient GPU inference that enables running Llama 3.1 405B on a single node.

Compression Compression Efficiency LLMs GPU Lossless

Research

Medical Image Spatial Grounding with Semantic Sampling

Ranking Reasoning LLMs under Test-Time Scaling

Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation

Quantize What Counts: More for Keys, Less for Values

70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float