GPU

Tag: GPU

3 items tagged with "GPU"

70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float

December 2, 2025

NeurIPS 2025 presentation on Dynamic-Length Float (DFloat11/DF11): a lossless format that Huffman-codes BFloat16 exponents down to ~11 bits, cutting model size ~30% with bit-for-bit identical outputs and a GPU kernel that makes compressed inference fast.

Poster

70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float

Tianyi Zhang, Mohsen Hariri, Shaochen Zhong, Vipin Chaudhary, Yang Sui, Xia Hu, Anshumali Shrivastava

NeurIPS 2025

December 2, 2025

NeurIPS 2025 poster on DFloat11: a lossless compression framework that shrinks LLMs and diffusion transformers to ~70% of their size with bit-for-bit identical outputs, plus a GPU kernel that decompresses on the fly.

Paper

70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float

Tianyi Zhang, Mohsen Hariri, Shaochen Zhong, Vipin Chaudhary, Yang Sui, Xia Hu, Anshumali Shrivastali

October 19, 2025

DFloat11 compresses LLMs to 70% of their original size while maintaining bit-for-bit identical outputs. A lossless compression framework with efficient GPU inference that enables running Llama 3.1 405B on a single node.