70% Size, 100% Accuracy

Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float (DFloat11)

Tianyi Zhang Mohsen Hariri Shaochen Zhong Vipin Chaudhary Yang Sui Xia Hu Anshumali Shrivastava

Scan for the interactive poster

Lossy compression is a gamble

  • Quantization loses accuracy: 8-bit SmoothQuant drops DeepSeek-R1-Distill-Qwen-1.5B reasoning by 9.09%.
  • Behavior shifts at equal accuracy: 6.37% of GSM8K answers flip under W8A16 GPTQ.
  • Prior lossless coders (ZipNN, NeuZip) shrink checkpoints but barely speed up GPU inference.

DFloat11 is lossless: outputs are bit-for-bit identical, so there is nothing to stress-test per deployment.

BFloat16's exponent is almost empty

BFloat16 bit layout: 1 sign, 8 exponent, 7 mantissa bits
Shannon entropy of sign, exponent, and mantissa bits across LLMs
Across LLMs the 8-bit exponent holds only ~2.6 bits of Shannon entropy; sign and mantissa are near-full.

2.6 / 8exponent bits used

~40 / 256exponent values seen

DFloat11: code the exponent, keep the rest

DFloat11 format: Huffman-coded exponents, fixed sign and mantissa, decoded by a Huffman tree
Huffman-code the exponent to dynamic length; the sign and mantissa stay fixed. 16 bits → ~11 bits, decoded with a Huffman tree.

(1)sign×2exp127×(1.mantissa)(-1)^{\text{sign}} \times 2^{\,\text{exp}-127} \times (1.\text{mantissa})

Decoded on the GPU, on the fly

  • Compact hierarchical LUTs fit in SRAM for table-based decoding.
  • Two-phase kernel with gap & block-output arrays places every thread.
  • Block-level batched decompression hides latency.

Kernel internals, math & ablations → Detail

30% smaller, bit-for-bit identical

ModelBF16 → DF11Ratio
Llama 3.1 8B16.1 → 10.9 GB67.8%
Llama 3.3 70B141 → 95.4 GB67.6%
Llama 3.1 405B812 → 551 GB67.9%
Qwen 3 14B29.5 → 20.1 GB68.2%
FLUX.1 dev23.8 → 16.3 GB68.6%
SD 3.5 Large16.3 → 11.3 GB69.5%

MMLU, TruthfulQA, WikiText & C4: identical to BF16 (~11-bit average width).

2.3–46.2× faster than CPU offloading

Throughput and latency of DF11 on one GPU versus BF16 with CPU offloading across batch sizes
Token throughput (left) and latency (right): DF11 on one GPU vs BF16 with CPU offloading, across batch sizes.

46.2×peak vs CPU offload

2.3×worst case, still ahead

5.7–14.9× longer generation

GPU memory versus decoded tokens for BF16 and DF11 models
Freed memory feeds the KV cache — DF11 decodes 5.7–14.9× more tokens before running out of memory.

14.9×more tokens before OOM

28%less memory (diffusion)

Takeaways

  • 30% smaller, for free
  • Bit-for-bit lossless
  • 405B on one 8×80GB node
  • LLMs & diffusion models