📄 Paper
70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float
Tianyi Zhang, Mohsen Hariri, Shaochen Zhong, Vipin Chaudhary, Yang Sui, Xia Hu, Anshumali Shrivastali
DFloat11 compresses LLMs to 70% of their original size while maintaining bit-for-bit identical outputs. A lossless compression framework with efficient GPU inference that enables running Llama 3.1 405B on a single node.