Quantize What Counts

More for Keys, Less for Values

Mohsen Hariri Alan Luo Weicong Chen Tianyi Zhang Qifan Wang Xiaotian Han Vipin Chaudhary

Interactive poster

Keys Need More Bits: K4V2 > K2V4

Key and value spectral norms plus GSM8K accuracy for K2V4 and K4V2
SizeK2V4K4V2
1B0.060.34
8B0.550.75
14B0.780.91
70B0.760.87

A common practice is to use the same quantization bit width for key and value caches.

98.3%avg. K4V4K_4V_4 accuracy retained by K4V2K_4V_2 with HQQ

25%KV-cache memory reduction versus K4V4K_4V_4

+48 ppGSM8K gain when the extra bit goes to keys at ultra-low precision

From Norm Gap to Bit Rule

MeasureWKF>WVF\lVert W^K\rVert_F > \lVert W^V\rVert_F

BoundAA^F2bAF\lVert A-\widehat{A}\rVert_F \propto 2^{-b}\lVert A\rVert_F

AllocatebK>bVb_K > b_V

Theorem 1: Key-Value Norm Disparity

Key projection weights carry larger energy than value weights.

E ⁣[WKF]>E ⁣[WVF]\mathbb{E}\!\left[\lVert W^K \rVert_F\right] > \mathbb{E}\!\left[\lVert W^V \rVert_F\right].

Consequence A one-time weight diagnostic predicts which cache is more fragile.

Theorem 2: Key-Prioritized Quantization

With a fixed bit budget, the larger-norm cache should receive extra precision.

E ⁣[KF2]>E ⁣[VF2]\mathbb{E}\!\left[\lVert K \rVert_F^2\right] > \mathbb{E}\!\left[\lVert V \rVert_F^2\right]
Acc(bK,bV)>Acc(bV,bK)\Rightarrow \operatorname{Acc}(b_K,b_V) > \operatorname{Acc}(b_V,b_K).

Consequence For the same memory budget, K4V2K_4V_2 beats the swapped split.

In practice: Keep keys sharp. Compress values first.

Norm Dynamics of KV Weights

Frobenius norms of key and value weights across Llama 3 models
  • Across models, WKF\lVert W^K \rVert_F exceeds WVF\lVert W^V \rVert_F in almost every layer.
  • The asymmetry follows the attention path: keys shape lookup geometry; values carry retrieved content.
  • Equal bit-widths under-protect the higher-energy signal.

Quantization Error (MSE)

MSE quantization error curves for key and value caches from 2-bit to 8-bit precision on Llama 3.3 70B

Matched bitsMSE(Kb)>MSE(Vb)\operatorname{MSE}(K_b)>\operatorname{MSE}(V_b)

Error rulespend precision on keys first

MSE is the Frobenius reconstruction error per cache entry: MM^F2/nnz(M)\lVert M-\widehat{M}\rVert_F^2/\operatorname{nnz}(M). On Llama 3.3-70B/C4, key-cache error stays above value-cache error from 2 to 8 bits, so equal bit-widths leave the dominant distortion in KK.

Geometry of Key Value Caches: Singular Value Spectra

Singular value spectra for key and value activations across layers in Llama 3.3 70B

Keys hold broader high-magnitude spectra. The gap persists across layers in Llama 3.3-70B on C4.

Practical reading: protect the addressing channel first; shrink the payload channel after.

Orthogonal Synergy: Fix Key Distortion First

Heatmaps showing rotation strategy effects under mixed-precision quantization
K4V2K_4V_2+key-only rotation closely tracks the K4V4K_4V_4 baseline; rotating both adds little beyond keys.
Bar charts comparing downstream accuracy under different key and value group sizes
With K4V2K_4V_2, gsK=32\mathrm{gs}_K=32 is best overall; gsV=64\mathrm{gs}_V=64 or 128128 preserves accuracy with less overhead.

K4V2K_4V_2 Base split Protect keys; compress values.

key-only rotation Outliers Redistribute key outliers; V-only adds little.

gsK=32, gsV64\mathrm{gs}_K=32,\ \mathrm{gs}_V\ge64 Groups Fine key scales; coarse values stay cheap.

Conclusion & Takeaways

1 Quantization error MSE across bit-widths C4 · MMLU · GSM8K

2 Downstream accuracy Mixed-precision task accuracy GSM8K · CoQA · EQ-Parseable

3 Integration study Rotation and group-size effects CoQA · GSM8K · EQ-Bench · LongBench

  • Keys carry more information than values.
  • Keys are quantization-sensitive: assign more bits to KK and fewer bits to VV.
  • Key outliers dominate rotation gains: key-favored bit splits integrate with rotation and group-size choices.
CacheBitsGroup sizeRotation
KeysHigherSmallerApply
ValuesLowerLargerSkip

For non-uniform KV-cache quantization with vLLM and Hugging Face Transformers, see our GitHub repository.

Takeaway: protect the higher-information key path first, then compress values more aggressively.

Acknowledgment: This research was supported in part by NSF awards 2117439, 2112606, & 2320952.