Quantize What Counts: More for Keys, Less for Values

Keys Need More Bits: K4V2 > K2V4

Key and value spectral norms plus GSM8K accuracy for K2V4 and K4V2

Size	K₂V₄	K₄V₂
1B	0.06	0.34
8B	0.55	0.75
14B	0.78	0.91
70B	0.76	0.87

A common practice is to use the same quantization bit width for key and value caches.

98.3%avg. $K_4V_4$ accuracy retained by $K_4V_2$ with HQQ

25%KV-cache memory reduction versus $K_4V_4$

+48 ppGSM8K gain when the extra bit goes to keys at ultra-low precision

From Norm Gap to Bit Rule

Measure $\lVert W^K\rVert_F > \lVert W^V\rVert_F$

Bound $\lVert A-\widehat{A}\rVert_F \propto 2^{-b}\lVert A\rVert_F$

Allocate $b_K > b_V$

Theorem 1: Key-Value Norm Disparity

Key projection weights carry larger energy than value weights.

$\mathbb{E}\!\left[\lVert W^K \rVert_F\right] > \mathbb{E}\!\left[\lVert W^V \rVert_F\right]$ .

Consequence A one-time weight diagnostic predicts which cache is more fragile.

Theorem 2: Key-Prioritized Quantization

With a fixed bit budget, the larger-norm cache should receive extra precision.

$\mathbb{E}\!\left[\lVert K \rVert_F^2\right] > \mathbb{E}\!\left[\lVert V \rVert_F^2\right]$
$\Rightarrow \operatorname{Acc}(b_K,b_V) > \operatorname{Acc}(b_V,b_K)$ .

Consequence For the same memory budget, $K_4V_2$ beats the swapped split.

In practice: Keep keys sharp. Compress values first.

Norm Dynamics of KV Weights

Frobenius norms of key and value weights across Llama 3 models

Across models, $\lVert W^K \rVert_F$ exceeds $\lVert W^V \rVert_F$ in almost every layer.
The asymmetry follows the attention path: keys shape lookup geometry; values carry retrieved content.
Equal bit-widths under-protect the higher-energy signal.

Quantization Error (MSE)

MSE quantization error curves for key and value caches from 2-bit to 8-bit precision on Llama 3.3 70B

Matched bits $\operatorname{MSE}(K_b)>\operatorname{MSE}(V_b)$

Error rulespend precision on keys first

MSE is the Frobenius reconstruction error per cache entry: $\lVert M-\widehat{M}\rVert_F^2/\operatorname{nnz}(M)$ . On Llama 3.3-70B/C4, key-cache error stays above value-cache error from 2 to 8 bits, so equal bit-widths leave the dominant distortion in $K$ .

Geometry of Key Value Caches: Singular Value Spectra

Singular value spectra for key and value activations across layers in Llama 3.3 70B

Keys hold broader high-magnitude spectra. The gap persists across layers in Llama 3.3-70B on C4.

Practical reading: protect the addressing channel first; shrink the payload channel after.

Orthogonal Synergy: Fix Key Distortion First

Heatmaps showing rotation strategy effects under mixed-precision quantization — $K_4V_2$ +key-only rotation closely tracks the $K_4V_4$ baseline; rotating both adds little beyond keys.

Bar charts comparing downstream accuracy under different key and value group sizes — With $K_4V_2$ , $\mathrm{gs}_K=32$ is best overall; $\mathrm{gs}_V=64$ or $128$ preserves accuracy with less overhead.

$K_4V_2$ Base split Protect keys; compress values.

key-only rotation Outliers Redistribute key outliers; V-only adds little.

$\mathrm{gs}_K=32,\ \mathrm{gs}_V\ge64$ Groups Fine key scales; coarse values stay cheap.

Conclusion & Takeaways

1 Quantization error MSE across bit-widths C4 · MMLU · GSM8K

2 Downstream accuracy Mixed-precision task accuracy GSM8K · CoQA · EQ-Parseable

3 Integration study Rotation and group-size effects CoQA · GSM8K · EQ-Bench · LongBench

Keys carry more information than values.
Keys are quantization-sensitive: assign more bits to $K$ and fewer bits to $V$ .
Key outliers dominate rotation gains: key-favored bit splits integrate with rotation and group-size choices.

Cache	Bits	Group size	Rotation
Keys	Higher	Smaller	Apply
Values	Lower	Larger	Skip

For non-uniform KV-cache quantization with vLLM and Hugging Face Transformers, see our GitHub repository.

Takeaway: protect the higher-information key path first, then compress values more aggressively.

Acknowledgment: This research was supported in part by NSF awards 2117439, 2112606, & 2320952.