Quantize What Counts: More For Keys, Less For Values β˜οΈπŸ”‘πŸ‘‡πŸ”’

  • 1Case Western Reserve University
  • 2Rice University
  • 3Meta

TL;DR: Keys carry more information than values; consequently, key tensors require a larger quantization bit-width, smaller group sizes, and outlier mitigation (e.g., Hadamard transformation).

Abstract

Large Language Models (LLMs) suffer inference-time memory bottlenecks dominated by the attention Key–Value (KV) cache, which scales with model size and context length. While KV-cache quantization alleviates this cost, bit allocation between keys and values is often tuned heuristically, lacking theoretical grounding and generalizability. This paper proposes two theorems that anchor mixed-precision KV quantization in the intrinsic geometry of Transformer models. First, key projections systematically have larger spectral and Frobenius norms than value matrices, implying higher information density along the key path. Second, for any given memory budget, prioritizing precision for keys over values strictly reduces quantization error and better preserves accuracy. Empirical evaluations across various prominent LLMs and benchmarks show that key-favored allocations (e.g., 4-bit keys, 2-bit values) retain up to 98.3% accuracy compared to uniform allocations (e.g., 4-bit for both), while conserving memory. These results transform bit allocation from ad hoc tuning into a theoretically grounded, geometry-driven design principle for efficient LLM inference. Source code is available at https://github.com/mohsenhariri/spectral-kv.

Theorem I. Key–Value Norm Disparity Let WK\mathrm{W}^K and WV\mathrm{W}^V denote the key and value projection matrices in an attention block. Then

E[βˆ₯WKβˆ₯F]>E[βˆ₯WVβˆ₯F].\mathbb{E}\left[\lVert \mathrm{W}^K \rVert_{F}\right] > \mathbb{E}\left[\lVert \mathrm{W}^V \rVert_{F}\right].

Theorem II. Key–Prioritized Quantization Let (bK,bV)(b_K,b_V) denote the bit allocations for key and value caches under a uniform scalar quantizer. For any pair with bK>bVb_K > b_V, the expected inference accuracy is strictly higher than for the swapped allocation (bV,bK)(b_V,b_K), provided that

E[βˆ₯Kβˆ₯F2]>E[βˆ₯Vβˆ₯F2].\mathbb{E}\left[\lVert K \rVert_{F}^{2}\right] > \mathbb{E}\left[\lVert V \rVert_{F}^{2}\right].

Singular value distributions over layers

Singular value spectra of key and value caches in Llama 3.3-70B on C4.
Singular value spectra of key and value caches in Llama 3.3-70B on C4. The x-axis shows the singular value indices, and the y-axis shows their magnitudes. Shaded regions indicate the minimum–maximum range across attention heads within each layer, while dashed lines denote the mean at each index. Beyond the top singular value (i.e., the spectral norm), key activations consistently exhibit larger singular values than value activations across the spectrum, highlighting their greater representational capacity.
Spectral norm gap (K > V)
Spectral norm gap (K > V)
Frobenius norm gap (K > V)
Frobenius norm gap (K > V)

Ablations

Quantization bit-width impact

ModelShotsKβ‚‚Vβ‚‚Kβ‚‚Vβ‚„Kβ‚„Vβ‚‚Kβ‚„Vβ‚„
Llama 3.2-1B-it10.0330.0350.3380.357
80.0310.0310.2890.369
Llama 3.1-8B-it10.5110.5470.7520.754
80.4080.4410.7700.782
Phi 4-14B10.7590.7830.9130.923
80.7710.8150.9270.931
DeepSeek R1Q-14B10.7720.7750.8650.867
80.7630.7920.8760.875

Group size impact

Group size comparison

Rotation impact

Rotation impact heatmap

Citation

@misc{hariri2025quantizecountskeysvalues,
      title={Quantize What Counts: More for Keys, Less for Values}, 
      author={Mohsen Hariri and Alan Luo and Weicong Chen and Shaochen Zhong and Tianyi Zhang and Qifan Wang and Xia Hu and Xiaotian Han and Vipin Chaudhary},
      year={2025},
      eprint={2502.15075},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2502.15075}, 
}

Acknowledgments

This research was supported in part by NSF awards 2117439, 2112606, and 2320952.

Contact

For questions or correspondence, don't hesitate to reach out