Research

Medical Image Spatial Grounding with Semantic Sampling

Andrew Seohwan Yu, Mohsen Hariri, Kunio Nakamura, Mingrui Yang, Xiaojuan Li, Vipin Chaudhary Equal contribution

MIS-Ground is a controlled factorial benchmark that isolates the language-side brittleness behind 3D medical spatial grounding in vision-language models. MIS-SemSam, a training-free semantic-sampling decode rule, lifts Qwen3-VL-32B by 13.06% to 66.5% overall, the best open-weights result and above Gemini 3 Flash.

Ranking Reasoning LLMs under Test-Time Scaling

Mohsen Hariri, Michael Hinczewski, Jing Ma, Vipin Chaudhary

Ranking reasoning LLMs under test-time scaling. We compare 72 ranking methods (Bayes@N, Bradley-Terry, Elo, IRT, voting, graph/spectral) across 20 models and four Olympiad-style math benchmarks. At full budget they mostly agree (Kendall's tau_b 0.93-0.95); at one trial a greedy prior cuts variance 16-52% but can bias the ranking. Packaged in the Scorio toolkit.