Paper

Quantize What Counts: More For Keys, Less For Values β˜οΈπŸ”‘πŸ‘‡πŸ”’

Mohsen Hariri, Alan Luo, Weicong Chen, Shaochen Zhong, Tianyi Zhang, Qifan Wang, Xia Hu, Xiaotian Han, Vipin Chaudhary

Key-favored KV-cache quantization for LLMs: theory shows keys have larger norms and should get more bits; empirics show 4b-K/2b-V preserves up to 98.3% accuracy while cutting memory.

Share

TL;DR: Keys carry more information than values; consequently, key tensors require a larger quantization bit-width, smaller group sizes, and outlier mitigation (e.g., Hadamard transformation).

Abstract

Large Language Models (LLMs) suffer inference-time memory bottlenecks dominated by the attention Key–Value (KV) cache, which scales with model size and context length. While KV-cache quantization alleviates this cost, bit allocation between keys and values is often tuned heuristically, lacking theoretical grounding and generalizability. This paper proposes two theorems that anchor mixed-precision KV quantization in the intrinsic geometry of Transformer models. First, key projections systematically have larger spectral and Frobenius norms than value matrices, implying higher information density along the key path. Second, for any given memory budget, prioritizing precision for keys over values strictly reduces quantization error and better preserves accuracy. Empirical evaluations across various prominent LLMs and benchmarks show that key-favored allocations (e.g., 4-bit keys, 2-bit values) retain up to 98.3% accuracy compared to uniform allocations (e.g., 4-bit for both), while conserving memory. These results transform bit allocation from ad hoc tuning into a theoretically grounded, geometry-driven design principle for efficient LLM inference. Source code is available at https://github.com/mohsenhariri/spectral-kv.

Theorem I. Key–Value Norm Disparity Let \mathrm{W}^K and \mathrm{W}^V denote the key and value projection matrices in an attention block. Then

\mathbb{E}\left[\lVert \mathrm{W}^K \rVert_{F}\right] > \mathbb{E}\left[\lVert \mathrm{W}^V \rVert_{F}\right].

Theorem II. Key–Prioritized Quantization Let (b_K,b_V) denote the bit allocations for key and value caches under a uniform scalar quantizer. For any pair with b_K > b_V, the expected inference accuracy is strictly higher than for the swapped allocation (b_V,b_K), provided that

\mathbb{E}\left[\lVert K \rVert_{F}^{2}\right] > \mathbb{E}\left[\lVert V \rVert_{F}^{2}\right].

Singular value distributions over layers

Singular value spectra of key and value caches in Llama 3.3-70B on C4. The x-axis shows the singular value indices, and the y-axis shows their magnitudes. Shaded regions indicate the minimum–maximum range across attention heads within each layer, while dashed lines denote the mean at each index. Beyond the top singular value (i.e., the spectral norm), key activations consistently exhibit larger singular values than value activations across the spectrum, highlighting their greater representational capacity.
Spectral norm gap (K > V)
Frobenius norm gap (K > V)

Ablations

Quantization bit-width impact

ModelShotsKβ‚‚Vβ‚‚Kβ‚‚Vβ‚„Kβ‚„Vβ‚‚Kβ‚„Vβ‚„
Llama 3.2-1B-it10.0330.0350.3380.357
80.0310.0310.2890.369
Llama 3.1-8B-it10.5110.5470.7520.754
80.4080.4410.7700.782
Phi 4-14B10.7590.7830.9130.923
80.7710.8150.9270.931
DeepSeek R1Q-14B10.7720.7750.8650.867
80.7630.7920.8760.875

Group size impact

Rotation impact