Serving Reasoning LLMs Efficiently and Reliably [No Anime]
Serving reasoning LLMs efficiently and reliably: lossless DFloat11 compression, KV-cache quantization, and Bayes@N evaluation and ranking under test-time scaling.
6 items tagged with "Quantization"
Serving reasoning LLMs efficiently and reliably: lossless DFloat11 compression, KV-cache quantization, and Bayes@N evaluation and ranking under test-time scaling.
Serving reasoning LLMs efficiently and reliably: lossless DFloat11 compression, KV-cache quantization, and Bayes@N evaluation and ranking under test-time scaling.
ACL 2026 presentation on Quantize What Counts: More for Keys, Less for Values, explaining key-value norm disparity, key-prioritized quantization, and practical KV-cache compression guidance.
A geometry-driven mixed-precision KV-cache quantization poster showing that keys carry more information than values, so key-favored bit allocation preserves accuracy while reducing memory.
M.Sc. Thesis in Computer Science
Key-favored KV-cache quantization for LLMs: theory shows keys have larger norms and should get more bits; empirics show 4b-K/2b-V preserves up to 98.3% accuracy while cutting memory.