Serving Reasoning LLMs Efficiently and Reliably [No Anime]
Serving reasoning LLMs efficiently and reliably: lossless DFloat11 compression, KV-cache quantization, and Bayes@N evaluation and ranking under test-time scaling.
9 items tagged with "Efficiency"
Serving reasoning LLMs efficiently and reliably: lossless DFloat11 compression, KV-cache quantization, and Bayes@N evaluation and ranking under test-time scaling.
Serving reasoning LLMs efficiently and reliably: lossless DFloat11 compression, KV-cache quantization, and Bayes@N evaluation and ranking under test-time scaling.
ACL 2026 presentation on Quantize What Counts: More for Keys, Less for Values, explaining key-value norm disparity, key-prioritized quantization, and practical KV-cache compression guidance.
A geometry-driven mixed-precision KV-cache quantization poster showing that keys carry more information than values, so key-favored bit allocation preserves accuracy while reducing memory.
NeurIPS 2025 presentation on Dynamic-Length Float (DFloat11/DF11): a lossless format that Huffman-codes BFloat16 exponents down to ~11 bits, cutting model size ~30% with bit-for-bit identical outputs and a GPU kernel that makes compressed inference fast.
NeurIPS 2025 poster on DFloat11: a lossless compression framework that shrinks LLMs and diffusion transformers to ~70% of their size with bit-for-bit identical outputs, plus a GPU kernel that decompresses on the fly.
BFloat16 uses 8 bits to store exponents, but those 8 bits carry only about 2.6 bits of actual information in trained neural networks. Regardless of the initialization and training recipe.
Key-favored KV-cache quantization for LLMs: theory shows keys have larger norms and should get more bits; empirics show 4b-K/2b-V preserves up to 98.3% accuracy while cutting memory.
DFloat11 compresses LLMs to 70% of their original size while maintaining bit-for-bit identical outputs. A lossless compression framework with efficient GPU inference that enables running Llama 3.1 405B on a single node.