Inference

Tag: Inference

9 items tagged with "Inference"

Serving Reasoning LLMs Efficiently and Reliably [No Anime]

July 6, 2026

Serving reasoning LLMs efficiently and reliably: lossless DFloat11 compression, KV-cache quantization, and Bayes@N evaluation and ranking under test-time scaling.

Slide

Serving Reasoning LLMs Efficiently and Reliably

July 6, 2026

Serving reasoning LLMs efficiently and reliably: lossless DFloat11 compression, KV-cache quantization, and Bayes@N evaluation and ranking under test-time scaling.

Slide

Quantize What Counts: More for Keys, Less for Values

June 12, 2026

ACL 2026 presentation on Quantize What Counts: More for Keys, Less for Values, explaining key-value norm disparity, key-prioritized quantization, and practical KV-cache compression guidance.

Poster

Quantize What Counts: More for Keys, Less for Values

Mohsen Hariri, Alan Luo, Weicong Chen, Tianyi Zhang, Qifan Wang, Xiaotian Han, Vipin Chaudhary

June 4, 2026

A geometry-driven mixed-precision KV-cache quantization poster showing that keys carry more information than values, so key-favored bit allocation preserves accuracy while reducing memory.

Slide

70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float

December 2, 2025

NeurIPS 2025 presentation on Dynamic-Length Float (DFloat11/DF11): a lossless format that Huffman-codes BFloat16 exponents down to ~11 bits, cutting model size ~30% with bit-for-bit identical outputs and a GPU kernel that makes compressed inference fast.

Poster

70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float

Tianyi Zhang, Mohsen Hariri, Shaochen Zhong, Vipin Chaudhary, Yang Sui, Xia Hu, Anshumali Shrivastava

NeurIPS 2025

December 2, 2025

NeurIPS 2025 poster on DFloat11: a lossless compression framework that shrinks LLMs and diffusion transformers to ~70% of their size with bit-for-bit identical outputs, plus a GPU kernel that decompresses on the fly.

Post

Simulating LLM Answers to Evaluation Datasets

October 22, 2025

Explore how simulating LLM responses to evaluation datasets with stochastic sampling is like flipping biased coins—revealing variability, bias, and the importance of multiple trials for reliable benchmarking.

Paper

Don’t Pass@𝑘: A Bayesian Framework for Large Language Model Evaluation

Mohsen Hariri, Amirhossein Samandar, Michael Hinczewski, Vipin Chaudhary

October 21, 2025

A Bayesian framework for evaluating large language models that replaces unstable Pass@k metrics with posterior estimates and credible intervals. The method improves sample efficiency, supports graded outcomes, and enables statistically sound model comparisons.

Paper

Quantize What Counts: More For Keys, Less For Values ☝️🔑👇🔢

Mohsen Hariri, Alan Luo, Weicong Chen, Shaochen Zhong, Tianyi Zhang, Qifan Wang, Xia Hu, Xiaotian Han, Vipin Chaudhary

October 20, 2025

Key-favored KV-cache quantization for LLMs: theory shows keys have larger norms and should get more bits; empirics show 4b-K/2b-V preserves up to 98.3% accuracy while cutting memory.