Source code for More for Keys, Less for Values: Adaptive KV Cache Quantization.
KVQ
can be installed via pip:
pip install kvq
Please note that an NVIDIA nvcc
compiler is required to build the package. Before installing, ensure that you have the following dependencies properly set up on your system:
build-essential
or cmake
)import torch
from kvq import KVQ, KVQCacheConfig
config = KVQCacheConfig(
nbits_k=4,
nbits_v=2,
axis_key=0,
axis_value=0,
q_group_size=64,
residual_length=128,
compute_dtype=torch.bfloat16,
backend="quanto",
device=model.device,
)
kvq = KVQ(config)
kvq_dict = {
"nbits_k": 4,
"nbits_v": 2,
"axis_key": 0,
"axis_value": 0,
"q_group_size": 64,
"residual_length": 128,
"compute_dtype": torch.float16,
"backend": "quanto",
"device": model.device,
}
kvq = KVQ(kvq_dict)
# Assume 'model' is a transformer-like model (e.g. Llama, Mistral, ...)
# that supports caching past key-value states.
outputs = model.generate(
**inputs,
max_new_tokens=1024,
use_cache=True,
past_key_values=kvq,
)
print(outputs)
If you find our method useful, please kindly cite our paper.
@article{hariri2025kvq,
title={More for Keys, Less for Values: Adaptive KV Cache Quantization},
author={Hariri, Mohsen and Nguyen, Lam and Chen, Sixu and Zhong, Shaochen and Wang, Qifan and Hu, Xia and Han, Xiaotian and Chaudhary, Vipin},
journal={arXiv preprint arXiv:2502.15075},
year={2025}
}
We welcome contributions from the research community to improve this work. If you have any idea or would like to report a bug, please open an issue or submit a pull request.
The code is released under the MIT License.