Serving Reasoning LLMs Efficiently and Reliably [No Anime]
Serving reasoning LLMs efficiently and reliably: lossless DFloat11 compression, KV-cache quantization, and Bayes@N evaluation and ranking under test-time scaling.
9 items tagged with "Inference"
Serving reasoning LLMs efficiently and reliably: lossless DFloat11 compression, KV-cache quantization, and Bayes@N evaluation and ranking under test-time scaling.
Serving reasoning LLMs efficiently and reliably: lossless DFloat11 compression, KV-cache quantization, and Bayes@N evaluation and ranking under test-time scaling.
ACL 2026 presentation on Quantize What Counts: More for Keys, Less for Values, explaining key-value norm disparity, key-prioritized quantization, and practical KV-cache compression guidance.
A geometry-driven mixed-precision KV-cache quantization poster showing that keys carry more information than values, so key-favored bit allocation preserves accuracy while reducing memory.
NeurIPS 2025 presentation on Dynamic-Length Float (DFloat11/DF11): a lossless format that Huffman-codes BFloat16 exponents down to ~11 bits, cutting model size ~30% with bit-for-bit identical outputs and a GPU kernel that makes compressed inference fast.
NeurIPS 2025 poster on DFloat11: a lossless compression framework that shrinks LLMs and diffusion transformers to ~70% of their size with bit-for-bit identical outputs, plus a GPU kernel that decompresses on the fly.
Explore how simulating LLM responses to evaluation datasets with stochastic sampling is like flipping biased coins—revealing variability, bias, and the importance of multiple trials for reliable benchmarking.
A Bayesian framework for evaluating large language models that replaces unstable Pass@k metrics with posterior estimates and credible intervals. The method improves sample efficiency, supports graded outcomes, and enables statistically sound model comparisons.
Key-favored KV-cache quantization for LLMs: theory shows keys have larger norms and should get more bits; empirics show 4b-K/2b-V preserves up to 98.3% accuracy while cutting memory.