Post
Entropy of bfloat16: 8 Bits Are Doing 2.6 Bits of Work
BFloat16 uses 8 bits to store exponents, but those 8 bits carry only about 2.6 bits of actual information in trained neural networks. Regardless of the initialization and training recipe.
3 items tagged with "Efficiency"
BFloat16 uses 8 bits to store exponents, but those 8 bits carry only about 2.6 bits of actual information in trained neural networks. Regardless of the initialization and training recipe.
Key-favored KV-cache quantization for LLMs: theory shows keys have larger norms and should get more bits; empirics show 4b-K/2b-V preserves up to 98.3% accuracy while cutting memory.
DFloat11 compresses LLMs to 70% of their original size while maintaining bit-for-bit identical outputs. A lossless compression framework with efficient GPU inference that enables running Llama 3.1 405B on a single node.