📝 Post
📦 Compression
Category: Compression
4 items in category "Compression"
📝 Post
Entropy of bfloat16: 8 Bits Are Doing 2.6 Bits of Work
📄 Paper
Quantize What Counts: More For Keys, Less For Values ☝️🔑👇🔢
Mohsen Hariri, Alan Luo, Weicong Chen, Shaochen Zhong, Tianyi Zhang, Qifan Wang, Xia Hu, Xiaotian Han, Vipin Chaudhary
Key-favored KV-cache quantization for LLMs: theory shows keys have larger norms and should get more bits; empirics show 4b-K/2b-V preserves up to 98.3% accuracy while cutting memory.
📄 Paper
70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float
Tianyi Zhang, Mohsen Hariri, Shaochen Zhong, Vipin Chaudhary, Yang Sui, Xia Hu, Anshumali Shrivastali
DFloat11 compresses LLMs to 70% of their original size while maintaining bit-for-bit identical outputs. A lossless compression framework with efficient GPU inference that enables running Llama 3.1 405B on a single node.