Category: Compression

4 items in category "Compression"

📝 Post

Entropy of bfloat16 During Training: How Optimizers Shape Weight Distributions

November 17, 2025

📝 Post

Entropy of bfloat16: 8 Bits Are Doing 2.6 Bits of Work

October 28, 2025

📄 Paper

Quantize What Counts: More For Keys, Less For Values ☝️🔑👇🔢

Mohsen Hariri, Alan Luo, Weicong Chen, Shaochen Zhong, Tianyi Zhang, Qifan Wang, Xia Hu, Xiaotian Han, Vipin Chaudhary

October 20, 2025

Key-favored KV-cache quantization for LLMs: theory shows keys have larger norms and should get more bits; empirics show 4b-K/2b-V preserves up to 98.3% accuracy while cutting memory.

📄 Paper

70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float

Tianyi Zhang, Mohsen Hariri, Shaochen Zhong, Vipin Chaudhary, Yang Sui, Xia Hu, Anshumali Shrivastali

October 19, 2025

DFloat11 compresses LLMs to 70% of their original size while maintaining bit-for-bit identical outputs. A lossless compression framework with efficient GPU inference that enables running Llama 3.1 405B on a single node.