Slide
Test-Time Scaling Under Budget
M.Sc. Thesis in Computer Science
4 items tagged with "Compression"
M.Sc. Thesis in Computer Science
During training, bfloat16 exponent bits evolve differently depending on the optimizer. Adam increases entropy, SGD decreases it, while AdamW consistently produces the ~2.6 bits observed in trained LLMs.
BFloat16 uses 8 bits to store exponents, but those 8 bits carry only about 2.6 bits of actual information in trained neural networks. Regardless of the initialization and training recipe.
DFloat11 compresses LLMs to 70% of their original size while maintaining bit-for-bit identical outputs. A lossless compression framework with efficient GPU inference that enables running Llama 3.1 405B on a single node.