Entropy of bfloat16 During Training: How Optimizers Shape Weight Distributions
During training, bfloat16 exponent bits evolve differently depending on the optimizer. Adam increases entropy, SGD decreases it, while AdamW consistently produces the ~2.6 bits observed in trained LLMs.