Paper

70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float

Tianyi Zhang, Mohsen Hariri, Shaochen Zhong, Vipin Chaudhary, Yang Sui, Xia Hu, Anshumali Shrivastava

Full text of DFloat11: a lossless compression framework that shrinks LLMs to ~70% of their size with bit-for-bit identical outputs and a GPU kernel for efficient compressed inference.

Share

Abstract

Large-scale AI models, such as Large Language Models (LLMs) and Diffusion Models (DMs), have grown rapidly in size, creating significant challenges for efficient deployment on resource-constrained hardware. In this paper, we introduce Dynamic-Length Float (DFloat11), a lossless compression framework that reduces LLM and DM size by 30% while preserving outputs that are bit-for-bit identical to the original model. DFloat11 is motivated by the low entropy in the BFloat16 weight representation of LLMs, which reveals significant inefficiency in the existing storage format. By applying entropy coding, DFloat11 assigns dynamic-length encodings to weights based on frequency, achieving near information-optimal compression without any loss of precision. To facilitate efficient inference with dynamic-length encodings, we develop a custom GPU kernel for fast online decompression. Our design incorporates the following: enumerate*[label=()] compact, hierarchical lookup tables (LUTs) that fit within GPU SRAM for efficient decoding, a two-phase GPU kernel for coordinating thread read/write positions using lightweight auxiliary variables, and transformer-block-level decompression to minimize latency. enumerate* Experiments on Llama 3.3, Qwen 3, Mistral 3, FLUX.1, and others validate our hypothesis that DFloat11 achieves around 30% model size reduction while preserving bit-for-bit identical outputs. Compared to a potential alternative of offloading parts of an uncompressed model to the CPU to meet memory constraints, DFloat11 achieves 2.3 - 46.2× higher throughput in token generation. With a fixed GPU memory budget, DFloat11 enables 5.7 - 14.9× longer generation lengths than uncompressed models. Notably, our method enables lossless inference of Llama 3.1 405B, an 810GB model, on a single node equipped with 8×80GB GPUs.

Introduction

Foundation models, such as Large Language Models (LLMs) and Diffusion Models (DMs), have demonstrated remarkable capabilities across a wide range of Natural Language Processing (NLP) [1] and Computer Vision (CV) tasks [2]. However, their huge model sizes create substantial obstacles for efficient deployment, especially in memory-constrained environments. For example, a competitive recent LLM, Llama 3.1 405B [3], has 405 billion parameters in 16-bit Brain Float (BFloat16) format and requires about 810 GB of memory for full inference, exceeding the capacity of a typical high-end GPU server (e.g., DGX A100/H100 with 8×80GB GPUs). As a result, deploying this model requires multiple nodes, making it expensive and inaccessible. In this work, we present a solution that compresses any BFloat16 model to approximately 70% of its original size while preserving 100% of its accuracy on any task.

Model compression via quantization has limitations.

Quantization is a type of lossy compression method that lowers the precision of model weights by converting them into lower bit-width representations [4, 5, 6, 7]. Although it can significantly reduce memory usage and often improve inference speed, quantization is not a one-size-fits-all solution and presents several key limitations: 202 Accuracy degradation. By design, quantization introduces approximation errors. The degree of accuracy loss depends on multiple factors, including the base model, quantization method, evaluation benchmark, and target bit-width [8]. These interactions make it difficult to predict or quantify the impact comprehensively. Even mild quantization can noticeably degrade performance. For example, applying 8-bit SmoothQuant [9] to DeepSeek-R1-Distill-Qwen-1.5B [10] results in a 9.09% drop in average accuracy across reasoning tasks [11]. 203 Behavioral shifts. Even when overall accuracy metrics appear roughly unchanged, quantized models may behave differently from their full-precision counterparts. For instance, Dutta et al. [12] observe a phenomenon called flips, where quantized models produce answers that change from correct to incorrect and vice versa. This indicates that quantization can significantly alter model behavior, even when standard accuracy metrics show minimal change. For example, the W8A16 GPTQ-quantized Qwen2-1.5B[4, 13] exhibits only a 0.3% drop in GSM8K (8-shot) accuracy [14], yet 6.37% of its answers flip in correctness [12]. 204 Compliance and reliability concerns. In domains like finance or healthcare, quantized models may not satisfy regulatory or reliability standards, as their outputs may differ from those of the original models [15]. We refer readers to Appendix 9 for a more detailed discussion on quantization.

Existing lossless model compression does not support efficient GPU inference.

Unlike lossy compression, lossless compression reduces model size while preserving the full precision of the original weights. This ensures the model's output distribution remains identical to that of the uncompressed counterpart. However, most existing lossless methods focus on storage efficiency, such as compressing model checkpoints [16, 17], or target specialized hardware like FPGAs [18], rather than accelerating inference on general-purpose GPUs. While useful for tasks like checkpoint rollback during large-scale training [19] or reducing download time from model hubs [17], these methods offer little to no benefit for GPU-based inference.

Our proposal, Dynamic-Length Float (DFloat11), is a lossless compression framework optimized for efficient GPU inference.

We identify a key inefficiency in the commonly used BFloat16 format: its 8-bit exponent field carries only about 2.6 bits of actual information. This redundancy is consistent across a wide range of LLMs, as shown in Section 2.2. To exploit it, we apply Huffman coding [20] to the exponent bits of BFloat16 weights, while leaving the sign and mantissa bits uncompressed. The resulting exponents have dynamic-length encodings: frequent values are assigned shorter codes, while rarer ones use longer codes. However, standard Huffman decoding relies on sequential bit-by-bit tree traversal, which is inefficient on GPUs due to limited parallelism. Assigning one GPU thread per decompression task leads to severe hardware underutilization and high latency. To overcome this, we design a hardware-aware algorithm that enables efficient online decompression of dynamic-length floats on GPUs. Our solution includes three key components: enumerate* compact, hierarchical lookup tables (LUTs) that fit in GPU SRAM to support fast, table-based Huffman decoding, a two-phase GPU kernel that uses lightweight auxiliary variables to coordinate thread-level read and write operations, and batched decompression at the transformer-block level to maximize throughput. enumerate*

We summarize our contributions as follows:

  1. We propose Dynamic-Length Float (DFloat11), a losslessly compressed floating-point format that reduces BFloat16 weights to approximately 11 bits. This yields around 30% model size reduction with bit-for-bit identical outputs.
  2. We develop optimized, hardware-aware algorithms for efficient GPU inference with DFloat11-compressed models by leveraging GPU memory and compute hierarchies.
  3. We evaluate DFloat11 across popular LLMs and diffusion transformers, including Llama 3, Qwen 3, Mistral 3, DeepSeek R1 Distilled, FLUX.1, and Stable Diffusion 3.5 [3, 21, 22, 10, 23, 24]. Our method consistently achieves 30% compression without altering original outputs at all. Notably, it enables running Llama-3.1-405B on a single node (8×80GB A100 GPUs), reducing hardware requirements by half without accuracy loss.

Method

In this section, we introduce our proposed floating-point format, Dynamic-Length Float (DFloat11), along with its custom decompression kernel designed for efficient GPU inference.

Preliminary

Brain Float (BFloat16)

Recent state-of-the-art LLMs predominantly employ the 16-bit Brain Float format (BFloat16 or BF16) for storing weights, due to its balance of numerical precision with memory efficiency. BF16 allocates its 16 bits as follows: 1 sign bit, 8 exponent bits, and 7 mantissa bits. The numerical value represented by a BF16 number is computed as:

(−1)sign×2exponent−127×(1.mantissa),
(1)

where mantissa is interpreted as a binary fractional value.

Entropy Coding

Entropy coding is a core technique in lossless data compression that leverages statistical redundancy to reduce data size. Several widely used methods fall under this category, including Huffman coding [20], arithmetic coding [25], and Asymmetric Numeral Systems (ANS) [26]. Among these, Huffman coding is one of the most widely adopted, which uses variable-length encoding to minimize the size of encoded data. It assigns shorter binary codes to more frequent symbols and longer codes to less frequent ones. The codes are decoded using a prefix-free binary tree, known as a Huffman tree. Due to the prefix-free property of Huffman codes, no code is a prefix of any other, which ensures unique decodability of the encoded bitstream without the need for delimiters. The tree is constructed based on symbol frequencies and is provably optimal for any given frequency distribution. However, decoding Huffman codes in a massively parallel manner is challenging due to its inherently sequential nature.

GPU Computation and Memory Paradigm

GPUs are designed to perform computations in a massively parallel manner. A modern GPU consists of thousands of threads, which are organized into blocks and executed on streaming multiprocessors (SMs). Each block has access to a small, fast, on-chip memory called shared memory (often referred to as SRAM), which provides much lower latency and higher bandwidth than the off-chip global memory, commonly known as high-bandwidth memory (HBM). The capacity of shared memory is limited, typically having up to 100 KB per block. In this work, we leverage the fast access characteristics of SRAM to enable efficient on-the-fly decompression of compressed weights during inference.

Figure 1. (Left) The allocation of bits for the components of BFloat16. (Right 3) The Shannon entropy of the components (sign, exponent, mantissa) of BFloat16 weights in various LLMs.

Motivation: BFloat16 Representation is Information Inefficient

To motivate the lossless compression of LLM weights, we analyze the compressibility of the BFloat16 weights of recent LLMs. Specifically, we use Shannon entropy to quantify the information content of BFloat16 components (sign, exponent, and mantissa) for all linear projection matrices of an LLM. The Shannon entropy H(⋅) is defined as:

H(X)=−x∈X∑​p(x)log2​p(x)
(2)

where X is a discrete random variable with support X, and p:X→[0,1] denotes its probability mass function. We present the computed entropy values in Figure ?. As shown, the entropy of the sign and mantissa bits is close to their respective bit widths, indicating limited potential for compression. In contrast, the exponent exhibits significantly lower entropy, approximately 2.6 bits versus its allocated 8 bits, suggesting substantial opportunities for lossless compression.

To understand this discrepancy, we visualize the frequency distribution of all BFloat16 components in Figure 8 and the ranked frequency of exponent values in Figure 9, both in the Appendix. The sign and mantissa values are relatively uniform across their ranges, but the exponent distribution is highly imbalanced: only about 40 of the 256 possible 8-bit values are used, with the rest never appearing. Ranked frequencies also decay rapidly. These observations reveal the low entropy of the exponent and its potential for compression.

Figure 2. Our proposed format Dynamic-Length Float for compressing BFloat16 weights of LLMs losslessly down to 11 bits. The exponents are compressed via Huffman coding, while the sign and mantissa bits remain uncompressed.

Dynamic-Length Float: Lossless LLM Compression for Efficient GPU Inference

To address the substantial information inefficiency in the BFloat16 representation of LLM weights, we propose a lossless compression framework that encodes floating-point parameters using entropy coding. Specifically, we build a Huffman tree based on the distribution of exponents in model weights. We then compress the exponents using Huffman coding, while preserving the original signs and mantissas. Exponents are encoded and tightly bit-packed into a byte array, EncodedExponent, while the sign and mantissa are left uncompressed and stored in a separate byte array PackedSignMantissa. Figure 2 illustrates Dynamic-Length Float (DFloat11 or DF11), our proposed format for compactly representing BFloat16 model parameters.

The Core Challenge: Efficient GPU Inference with Compressed Weights

While DFloat11 enables lossless compression of LLM weights, efficient GPU inference remains a key challenge. Entropy-coded weights use variable-length encoding and cannot be directly used in matrix multiplications. As a result, each weight matrix must be decompressed on-the-fly to its original BFloat16 format before matrix multiplication, then discarded immediately after use to conserve memory. However, traditional Huffman decoding is inherently sequential, requiring bit-by-bit tree traversal for each element, which is ill-suited for GPUs' parallel architecture. Naively assigning a single thread for decompression leads to poor utilization and high latency. Addressing this bottleneck is essential for practical compressed inference.

In the following paragraphs, we present our solution in detail: a set of hardware-aware algorithmic designs tailored for low-latency decoding of entropy-coded weights in a massively parallel manner. Our approach consists of three key components: 202 leveraging compact lookup tables that fit within GPU SRAM for efficient, lookup-based decoding, 203 introducing a two-phase kernel design to coordinate read/write operations for all threads using lightweight auxiliary variables, and 204 performing decompression at the transformer block level to minimize latency.

Figure 3. (Left) The Huffman tree is decomposed into a set of non-overlapping subtrees, each corresponding to a compact lookup table (LUT). These hierarchical LUTs reside in GPU SRAM to enable efficient Huffman decoding via array lookups. (Right) Each thread decodes n bytes of encoded exponents. The array Gaps stores the bit offset of the first element assigned to each thread, while the array Block Output Positions stores the index of the first element for each thread block.

Efficient Decoding with Hierarchical Lookup Tables

The traditional approach to decoding Huffman codes involves reading the encoded bitstream bit by bit and traversing the Huffman tree accordingly. However, this method is inefficient on GPUs due to frequent branching and limited parallelism. To enable efficient decoding on GPUs, we adopt a lookup-table-based approach [27].

Assume the maximum Huffman code length is L, and we construct a lookup table LUT of size 2L. At each index i, LUT stores the decoded exponent whose Huffman code matches the prefix of the binary representation of i. To decode the next exponent, we read the next L bits from the encoded bitstream, interpret them as an index into LUT, and retrieve the corresponding value. To determine how many bits to advance in the stream, we use a secondary lookup table CodeLengths, which maps each exponent to the length of its Huffman code. A detailed example of this decoding process is provided in Section 17 of the Appendix.

In practice, the value of L can be large. For LLMs, L typically ranges from 24 to 32, resulting in a LUT with up to 232 entries, which cannot fit within GPU SRAM for fast lookups. To address this, we decompose the monolithic LUT into a hierarchy of compact lookup tables [27]. We first partition the Huffman tree into non-overlapping subtrees of height 8. Each subtree corresponds to a compact LUT that decodes 8 bits, requiring only 28=256 entries.

Figure ? shows an example of how a Huffman tree of height 4 can be decomposed into a hierarchy of compact LUTs, each with 4 entries. Because the LUTs are organized hierarchically, some entries must serve as references to other LUTs lower in the hierarchy. We take advantage of the sparsity in 8-bit exponent usage: although 256 values are available, typically only around 40 are used in LLMs (see Figure 9 in the Appendix). We repurpose unused values (specifically, the range 240 to 255) as pointers to other LUTs. These values correspond to extremely large magnitudes (±2113 to ±2128) that do not occur in LLM weights, making them safe for use as internal markers.

We use k to denote the number of compact LUTs. In our experiments, we observe that k ranges from 4 to 8 for the Huffman trees built from BFloat16 exponent values. Combined with CodeLengths, these LUTs occupy at most (8+1)×256 bytes of memory, which easily fits within SRAM and allows for fast repeated lookups.

Two-Phase Kernel and Lightweight Auxiliary Variables

To leverage the parallel processing capabilities of GPUs, we assign each thread to a contiguous, non-overlapping block of encoded exponents consisting of n bytes (n=8 in our experiments). Each thread decodes elements whose Huffman codes begin within its assigned block. Since Huffman codes are variable-length, a thread may need to skip some bits at the start before decoding the first element. Similarly, the last element may span beyond the assigned byte range.

This approach introduces two key challenges: enumerate* The starting bit position for each thread is unclear due to the variable-length nature of Huffman codes. Except for the first thread, the index of decoded elements is unknown, making it difficult to determine their correct output locations. enumerate*

To address the first issue, we use a gap array [27] to specify the starting bit offset for each thread. The array Gaps has one entry per thread, where each entry indicates the offset of the first valid Huffman code relative to the thread's assigned starting byte. With a maximum code length of 32 bits, each offset lies in [0,31] and is stored using only 5 bits.

For the second issue, maintaining an output position for each thread is straightforward but memory-intensive. Each position requires a 32-bit integer, and with tens of thousands of threads per weight matrix, this leads to significant overhead, undermining DFloat11's compression benefits. To reduce this overhead, we store the output position only for the first element of each thread block rather than for every thread. Since each block typically contains hundreds to thousands of threads, this optimization reduces the overhead from one 32-bit integer per thread to one per block, making the memory cost negligible. Figure ? illustrates how the gap and block-level output position arrays encode the metadata associated with the encoded exponents.

To support this design, we implement a two-phase kernel. In the first phase, each thread decodes its assigned block and counts the number of elements, without writing to the HBM. Afterward, threads within a block synchronize to compute per-thread output positions via a prefix sum over the element counts. We use the Blelloch algorithm [28] for this step. In the second phase, each thread re-decodes the same block, this time writing decoded values to a write buffer in SRAM at the calculated positions. To avoid redundant global memory access, the encoded exponents are loaded into SRAM before the first pass. Once all decoded exponents are written to SRAM, a single batch of coalesced writes is issued to HBM. Pseudocode for the two-phase kernel is provided in Algorithm 1 of the Appendix.

Transformer-Block-Level Decompression

We now have a complete recipe for decompressing entropy-coded exponents in a massively parallel manner. During inference, the LLM weights stored in DFloat11 format, along with auxiliary variables (the thread-level gap array and block-level output position array), reside entirely in GPU memory. When a weight matrix is needed for matrix multiplication, it is decompressed on-the-fly into the original BFloat16 format. Once the matrix multiplication is complete, the BFloat16 matrix is immediately discarded to conserve GPU memory.

In practice, decompressing a single weight matrix often underutilizes GPU resources due to its relatively small size. As the matrix size increases, decompression throughput improves. Figure 7 illustrates this trend, showing how DFloat11 decompression scales with matrix size. To capitalize on this, we propose batching the decompression of multiple matrices together to improve throughput and hide latency. Specifically, we decompress all DFloat11 weight matrices within a transformer block as a single batch. This batched decompression occurs right before the forward pass of the transformer block. We also compress the token embedding and language modeling head of LLMs. Since these matrices are large enough to saturate GPU resources, batching their decompression is unnecessary.

Table 1. DF11 statistics for various models. Model sizes are shown before and after compression.
ModelOriginal DF11 CompressedCompression RatioAvg. Bit Width
4cLarge Language Models
gray!50 black Llama 3.1 8B Instruct16.06 GB 10.90 GB67.84%10.85
Llama 3.3 70B Instruct141.11 GB 95.40 GB67.61%10.82
Llama 3.1 405B Instruct811.71 GB 551.22 GB67.91%10.87
Qwen 3 14B29.54 GB 20.14 GB68.17%10.91
QwQ 32B65.53 GB 44.65 GB68.14%10.90
Mistral Nemo Instruct24.50 GB 16.59 GB67.74%10.84
Mistral Small 347.14 GB 31.86 GB67.58%10.81
Phi 4 Reasoning Plus29.32 GB 19.83 GB67.64%10.82
DeepSeek R1 Distill Llama 8B16.06 GB 10.89 GB67.81%10.85
4cDiffusion Transformers
gray!50 black FLUX.1 dev23.80 GB 16.33 GB68.61%10.98
FLUX.1 schnell23.78 GB 16.31 GB68.58%10.97
Stable Diffusion 3.5 Large16.29 GB 11.33 GB69.52%11.12

Experiments

Table 2. Comparison of accuracy and perplexity for the BF16 and DF11 models on different benchmarks. DF11 compression results in absolutely no loss in accuracy or perplexity.
2cAccuracy2cPerplexity
ModelData TypeMMLUTruthfulQAWikiTextC4
Llama 3.1 8B InstructBF1668.010 ± 0.37536.965 ± 1.6908.64921.677
DF11 (Ours)68.010 ± 0.37536.965 ± 1.6908.64921.677
Table 3. Comparison of peak GPU memory usage and text-to-image generation time for diffusion transformers in BF16 and DF11, using a single A5000 GPU.
2cPeak GPU Memory (GB)2cGeneration Time (s)
ModelBF16DF11 (Ours)BF16DF11 (Ours)
Stable Diffusion 3.5 Large16.4411.7866.36 ± 0.1369.08 ± 0.11
FLUX.1 dev23.1516.7274.41 ± 0.1578.53 ± 0.18

We empirically evaluate the effectiveness of DF11 compression and its GPU inference efficiency. A range of recent LLMs and DMs are compressed from their original BFloat16 format into DF11, and we report the resulting compression ratios. We then compare the inference performance of DF11-compressed models against their uncompressed counterparts across different GPUs, followed by an ablation study to analyze the impact of compression.

Software and Hardware We implement the DF11 decompression kernel in CUDA and C++, and integrate it into the HuggingFace Transformers [29] inference framework. We evaluate the inference efficiency of our DF11 models against the original BF16 counterparts. We use the HuggingFace Accelerate framework to support CPU offloading and multi-GPU inference. To assess the performance of the DF11 kernel across different hardware configurations, we run experiments on multiple machines with varying GPU and CPU setups. The hardware specifications for all experimental machines are provided in Table 4 in the Appendix.

Results

Figure 4. Comparison of throughput (top row) and latency (bottom row) for token decoding using the original BF16 models and their DF11-compressed counterparts. Portions of the BF16 models are offloaded to the CPU due to GPU memory constraints.
Figure 5. Comparison of GPU memory consumption between BF16 models and DF11 counterparts. The DF11 models support 5.70 - 14.86× longer context lengths by allowing more GPU memory to be used for storing the KV cache. "O.O.M." means out of memory.
DF11 compresses models to 70% size.

Table 1 presents the compression factors of DF11 for a wide selection of recent LLMs and DMs. Specifically, we apply compression to all weight matrices and token embeddings in LLMs and all weight matrices in the transformer blocks of DMs. The models we compress include Llama 3.1/3.3 [3], Qwen 3 [13], Mistral Nemo/Small [30, 22], Phi 4 [31], DeepSeek R1 Distilled [10], Stable Diffusion 3.5 [24], FLUX.1 [23]. DF11 achieves approximately 70% compression across all models, corresponding to an effective bit width of around 11 bits.

Accuracy and perplexity evaluations confirm DF11 compression is lossless.

We verify the lossless property of DF11 compression through a series of accuracy and perplexity evaluations on standard benchmarks. Evaluations are conducted using lm_evaluation_harness [32], reporting accuracy on MMLU [33] and TruthfulQA [34], and word-level perplexity on WikiText [35] and C4 [36]. The results are shown in Table 2. As demonstrated, the compressed model achieves identical accuracy and perplexity to the original BF16 counterpart. We also present the text-to-image generation results of BF16 and DF11 Stable Diffusion 3.5 Large model in Appendix 18. Given the same random seed and text prompt, the image generated are pixel-wise identical with the original model.

DF11 outperforms CPU offloading in inference efficiency.

We compare the inference performance of DF11 and BF16 models across various hardware platforms. Due to memory constraints, BF16 models exceed the capacity of a single GPU and require partial CPU offloading, while DF11 models fit entirely within GPU memory. For fair comparison, we retain most computation on the GPU for BF16 models and offload only necessary components. Latency and throughput are measured after a 100-token warm-up run, followed by decoding 100 tokens from an empty prompt across varying batch sizes. Each configuration is run five times, and we report the average results. As shown in Figure 4, DF11 consistently outperforms BF16 with CPU offloading, achieving 2.31 - 46.24× lower latency or higher throughput. Multi-GPU comparisons are shown in Figure 10 in the Appendix.

DF11 reduces memory usage for diffusion transformers with minimal latency impact.

We assess the impact of DF11 compression on diffusion transformer models by measuring peak GPU memory usage and text-to-image generation latency for an 1024×1024 image across five runs. Neither the BF16 nor DF11 models employ CPU offloading. As shown in Table 3, DF11 reduces memory consumption by 28.3% for Stable Diffusion 3.5 and 27.8% for FLUX.1. The relative increase in latency is small: 4.1% for Stable Diffusion and 5.5% for FLUX.1.

DF11 memory savings enable longer generation lengths.

DF11 compression not only can reduce the number of GPUs needed for inference but can also support longer generation under the same VRAM budget. During decoding, the KV cache grows linearly with the number of tokens and quickly becomes a memory bottleneck. Figure 5 shows GPU memory usage for DF11 and BF16 models with batch size 1 as token count increases. DF11 allows 5.70 to 14.86× more tokens to be decoded before reaching memory limits.

Figure 6. Comparison of latency breakdown for DF11 and BF16 Llama 3.1 8B Instruct during GPU inference for different token batch sizes, using one A100-40GB GPU.
Figure 7. Throughput (left two) and latency (right two) comparisons between transferring BF16 matrices from CPU to GPU and decompressing the same matrices on GPU using the NVIDIA nvCOMP ANS library and our proposed DF11 kernel, across matrix sizes and GPU types.

Ablation Study

Latency breakdown shows decompression overhead is amortized at larger batch sizes. We analyze the latency of Llama 3.1 8B Instruct in BF16 and DF11 formats across varying token batch sizes on an A100-40GB GPU. For each setting, we measure the average latency of each component over 10 runs, as shown in Figure 6. DF11 introduces additional latency from decompressing the token embedding, transformer blocks, and language modeling head. This overhead is constant and independent of batch size, so increasing the token batch size effectively amortizes the cost.

DF11 decompression is significantly faster than CPU-to-GPU transfer and nvCOMP ANS. We compare DF11 decompression latency and throughput with two baselines: CPU-to-GPU weight transfer and ANS decompression [26] from NVIDIA’s nvCOMP [37], using sliced weight matrices from the Llama 3.1 8B Instruct language modeling head. As shown in Figure 7, DF11 achieves up to 34.95× higher throughput than CPU transfer and up to 20.97× faster decompression than nvCOMP. DF11 also offers a better compression ratio (68%) compared to nvCOMP (79%). Moreover, DF11 decompression throughput improves with larger matrix sizes due to better GPU utilization.

Data Formats for Model Weights Full-precision model weights are typically stored in formats such as BF16, FP16, or FP32. Several works have proposed 4-bit compressed formats, including FP4, INT4, NF4 (NormalFloat) [38], AF4 (AbnormalFloat) [39], and SF4 (Student Float) [40], which represent each parameter with 4 bits. Unlike these lossy formats, the proposed DF11 format compresses weights losslessly.

Lossless Model Compression While lossy compression methods such as pruning [41] and quantization [5, 4] are well-studied, lossless compression remains less explored. Four prior works have addressed this area. Deep Compression [16] applied Huffman coding [20] to quantized CNNs, achieving 22% additional compression. ZipNN [17] extended this approach to language models with improved compression over classical methods. However, both techniques target storage efficiency and do not support inference-time gains. NeuZip [42] is the only prior work supporting GPU inference. It uses Asymmetric Numeral Systems (ANS) with layer-wise decompression and relies on NVIDIA's nvCOMP for GPU-based operations. nvCOMP is no longer open source, and its binary-only distribution limits adoption. Moreover, as shown in Figure 7, nvCOMP ANS incurs higher latency and lower throughput compared to our DFloat11 kernel. Huff-LLM [18] is designed for FPGA-like hardware and is not applicable to GPUs. Additional discussion of related formats is presented in Appendix 10.

Conclusion

We introduce Dynamic-Length Float (DFloat11), a lossless compression framework designed for efficient GPU inference of BFloat16 models, including both large language models (LLMs) and diffusion models (DMs). DFloat11 exploits the information redundancy inherent in foundation model weights through entropy-coded, dynamic-length encoding, achieving compression rates close to the information-theoretic limit. To enable efficient deployment, we develop hardware-aware algorithms that support high-speed inference directly on compressed weights. Extensive experiments demonstrate that DFloat11 significantly reduces GPU memory requirements for LLMs and DMs, allowing for longer generation lengths, while maintaining bit-exact accuracy and incurring only negligible decompression overhead.

Acknowledgements

This work was supported by National Science Foundation SHF-2211815 and Ken Kennedy Institute Cluster Grants. Additionally, Henry and Xia are supported by ITE-2429680, IIS-2310260, and US Department of Transportation (USDOT) Tier-1 University Transportation Center (UTC) Transportation Cybersecurity Center for Advanced Research and Education (CYBER-CARE) grant #69A3552348332. Mohsen and Vipin are supported by OAC-2320952, OAC-2112606, and OAC-2117439. The views and conclusions in this paper are those of the authors and do not represent the views of any funding or supporting agencies.

Appendix

Appendix

Discussion: Is Quantization a Universal Solution?

Much of the motivation behind our work lies in understanding whether lossless compression of large-scale models such as LLMs, which preserves 100% identical output behavior compared to the original uncompressed model, is a practical direction worthy of further study. Specifically, how does DFloat11, which compresses LLMs to approximately 11 bits, compare to widely used lossy quantization techniques [4, 5], where models are typically reduced to even lower bit-widths (e.g., 8-bit or 4-bit)?

The answer is far more nuanced than a simple "Yes/No" or a one-size-fits-all judgment about which approach is better. For instance, existing benchmark studies like [43, 44, 45] often suggest that 8-bit (weight-only or not) quantization is a relatively "safe" compression scheme. Although technically lossy, 8-bit models can often maintain strong task performance across a range of standard benchmarks. However, we must note these benchmarks typically focus on a narrow set of tasks (e.g., WikiText2 perplexity, MMLU, Commonsense Reasoning), and thus fail to offer a comprehensive view of real-world LLM usage, especially from the perspective of end-users.

That being said, the argument that "current benchmarks fail to capture the performance gap between 8-bit compressed and 16-bit uncompressed models" is itself constrained by the limitations of the current benchmarking landscape, making it difficult to produce abundant supporting evidence. Nonetheless, some reports have begun to highlight such gaps. For example, human evaluations on LLM Arena show a notable performance drop between Llama-3.1-405B-Instruct [3] and its W8A8 counterpart (Llama-3.1-405B-Instruct-FP8), particularly under coding (1293 vs. 1277) and long-query (1282 vs. 1275) tasks. Similarly, quantizing DeepSeek-R1-Distill-Llama-70B [10] from 16 bits to 8 bits results in a 23.7% drop on GPQA (from 9.51% to 7.25%). Furthermore, reasoning, a core capability of modern LLMs, appears especially sensitive to compression loss. Recent benchmark [11] reveals that quantizing DeepSeek-R1-Distill-Qwen-1.5B with 8-bit SmoothQuant [9] (for weight, attention, and KV cache) leads to an average 9.09% drop in reasoning tasks (48.82% to 44.29%) across datasets like AIME, MATH-500, GPQA-Diamond, and LiveCodeBench. We leave more evidence exploring the performance gap between 8-bit quantized and uncompressed model in Appendix 16.

Although the broader question: "Which specific task, on which model, using which quantization technique, under what conditions, will lead to a noticeable drop compared to FP16/BF16?" is likely to remain open-ended simply due to the sheer amount of potential combinations. It is fair to say that lossy quantization introduces complexities that some end-users would prefer to avoid, since it creates uncontrolled variables that must be empirically stress-tested for each deployment scenario.

To eliminate this burden, DFloat11 offers a compelling alternative: delivering 100% identical performance to the original model, while consuming only 70% of the memory footprint with throughput benefits, which is a unique and practical offering for resource-constrained deployment settings.

Data Formats for Model Weights

LLM weights are typically stored in compact floating-point formats such as FP16 or BFloat16 (officially stylized as bfloat16). FP16 allocates 1 sign bit, 5 exponent bits, and 10 mantissa bits, whereas BFloat16 uses 1 sign bit, 8 exponent bits, and 7 mantissa bits. Compared to FP16, BFloat16 offers a wider dynamic range at the cost of precision, which improves numerical stability and mitigates overflow issues during training [46, 47].

Compressed data formats typically aim for lower bit-widths. For example, FP8 - which comes in both E4M3 (4 exponent bits, 3 mantissa bits, plus 1 sign bit) and E5M2 configurations - has seen reasonable adoption in LLM training and development. Integer formats like INT8 have also been well explored, as in LLM.int8() [48] and its following works. Formats with a stronger emphasis on efficiency, such as FP4, INT4, NF4 [38], and AF4 [39], use only 4 bits. In this work, we primarily focus on formats with 8 bits, as benchmark literature [44, 43, 11] often suggests that 8-bit quantization results in negligible performance drop - though we show in Section 9 that this claim is likely skewed due to evaluation selectiveness and benchmark limitations.

Lossless Model Compression

While lossy model compression techniques such as pruning and quantization [41, 5, 4] have received widespread attention, lossless model compression remains a relatively underexplored area. Upon careful investigation, we identified roughly four prior works that have made meaningful efforts in this space. Deep Compression [16] is a foundational work, applying Huffman coding [20] to quantized CNN models and achieving an additional 22% compression gain for model checkpoints. ZipNN [17] extended this idea to language models, comparing its results to classic lossless compression tools such as zlib [49] and zstd and demonstrated superior compression gains. However, this line of work - including their industry counterparts, such as ezm7 - is limited in that its efficiency gains only apply to storage (reducing the size of model checkpoints) but offer no benefits during inference. While such storage savings are meaningful in large-scale training settings - where frequent snapshotting and checkpoint rollbacks are needed [19] - they have limited impact for everyday LLM end-users. Model downloading is typically a one-time cost, so even if a model checkpoint is compressed by 50%, it only cuts the download time at most by half, presumably over the model’s entire lifecycle of deployment. Furthermore, checkpoints are usually stored on disk, where terabytes of capacity are easily available, making up a much looser constraint compared to GPU HBM (High Bandwidth Memory); one of the main resource constraints during inference.

We argue that a lossless compression technique would be substantially more impactful if it could deliver efficiency gains during inference - particularly on GPU-based systems, which is the default setup for LLM serving. In this context, NeuZip [42] is the only prior work we identify that supports GPU inference. NeuZip applies entropy encoding with layer-wise decompression to maintain a reduced memory footprint throughout serving. However, it is built on NVIDIA's nvCOMP: "a high-speed data compression and decompression library optimized for NVIDIA GPUs". Unfortunately, nvCOMP is no longer open-source (only binary executables are available), which hinders future research. Moreover, we empirically find that nvCOMP’s inference throughput and latency are significantly worse than our proposed DFloat11 kernel, resulting in a pipeline that trades memory efficiency for substantial inference overhead (see Figure 7).

Another work referencing NeuZip is Huff-LLM [18], which also aims to reduce memory costs while maintaining efficient inference. However, its contributions are specific to FPGA-like architectures and do not apply to GPUs. To the best of our knowledge, the DFloat data format we presented (and its respective kernel support in DFloat11) shall serve as the only GPU-inference-friendly data format with lossless compression benefits.

Efficient LLM Inference

LLMs are computationally intensive and resource-demanding, making the efficiency of LLM inference a key research focus [50]. FlashAttention [51] accelerates exact attention computation on GPUs through kernel fusion, while NoMAD Attention [52] speeds up attention on CPUs using in-register lookups. Model compression is another effective strategy to reduce resource requirements for serving LLMs and diffusion models. Quantization methods such as GPTQ [4], AWQ [5], SmoothQuant [9], LeanQuant [53], CQ [54], KVQuant [55], and KIVI [56] lower memory usage and enhance efficiency by compressing model weights, activations, or KV cache. Compression is also applied in fine-tuning: methods like LoRA [57], QLoRA [38], and SketchTune [58] compress model weight deltas, whereas GaLore [59] and SARA [60] compress optimizer states during training. One additional line of work relevant to efficient LLM inference would be lossless efficient decoding, where paradigms such as speculative decoding [61, 62, 63] and n-gram candidate decoding [64, 65] offer lossless generation quality with improved latency. DFloat11 mainly differs from these works in that it provides substantial savings in memory footprint while maintaining lossless generation quality, whereas most—if not all—lossless efficient decoding methods require memory consumption equal to or greater than that of the original model.

Figure 8. Relative frequency distribution of sign, exponent, and mantissa values in the BFloat16 weights of all linear projection layers across various LLMs.

Frequency Distribution of BFloat16 Values

Figure 8 presents the frequency distribution for distinct values of sign, exponent, and mantissa bits in the BFloat16 weights of LLMs. Figure 9 shows the sorted frequency of exponent values of LLM weights.

Figure 9. Distribution of BFloat16 exponent values across various models. The frequency of exponent values (shown in log scale) decays rapidly with exponent rank.

Pseudo-code of the GPU kernel for DFloat11 Decompression

Algorithm 1 presents the pseudo-code of the two-phase GPU kernel for decompressing DFloat11 to BFloat16.

Algorithm 1. GPU kernel for decompressing DFloat11 to BFloat16
[1]
DF11ToBF16
require: itemize[label= - ]
$EncodedExponent, PackedSignMantissa$: byte arrays
$LUT_1, , LUT_k, CodeLengths$: 8-bit unsigned integer arrays of size 256
$Gaps$: 5-bit unsigned integer array (one entry per thread in each block)
$BlockOutputPos$: 32-bit unsigned integer array (one entry per block)
$Outputs$: BFloat16 array, for storing results
$B, T, n, k$: the number of thread blocks, number of threads, number of bytes processed by each thread, number of compact LUTs, respectively
itemize
State Divide $EncodedExponent$ into chunks:
2.5em $EncodedExponent_1, , EncodedExponent_B$ of size $nT$ bytes each
$b 1, , B$ (in parallel across blocks)
State Load $EncodedExponent_b$ into SRAM
State Divide $EncodedExponent_b$ into chunks:
4em $EncodedExponent_b,1, , EncodedExponent_b,T$ of size $n$ bytes each
State Load $LUT_1, , LUT_k, CodeLengths$ into SRAM
State Initialize integer arrays $NumElements[1 T], ThreadOutputPos[1 T]$ with all $0$s
State Initialize BFloat16 write buffer $WriteBuffer$ in SRAM
$t 1, , T$ (in parallel across threads)
gray$$ Phase 1: Each thread determines its initial output position
State $BitOffset Gaps[bT+t]$
While $BitOffset < 8n$
State Read the next 4 bytes of $EncodedExponent_b,t$, starting from the $BitOffset$-th bit, into $Byte_14$
State $i 1$
State $Exponent LUT_1[Byte_i]$
While $Exponent 240$
7.25em gray$$ $Exponent 240$ means that it is a pointer to the next LUT
State $i i + 1$
State $Exponent LUT_(257 - Exponent)[Byte_i]$
EndWhile
State $BitOffset BitOffset + CodeLengths[Exponent]$
State $NumElements[t] NumElements[t] + 1$
EndWhile
State Thread Synchronization Barrier
4.3em gray$$ Compute prefix-sum using Blelloch's Algorithm:
State $ThreadOutputPos[t] BlockOutputPos[b] + _i=1^t-1NumElements[i]$
gray$$ Phase 2: Writing decoded BFloat16s to the appropriate positions
State $BitOffset Gaps[bT+t]$
While $BitOffset < 8n$
State Read the next 4 bytes of $EncodedExponent_b,t$, starting from the $BitOffset$-th bit, into $Byte_14$
State $i 1$
State $Exponent LUT_1[Byte_i]$
While $Exponent 240$
7.25em gray$$ $Exponent 240$ means that it is a pointer to the next LUT
State $i i + 1$
State $Exponent LUT_(257 - Exponent)[Byte_i]$
EndWhile
State $Byte PackedSignMantissa]$
State $Sign Byte bitwise_and 0b10000000$
State $Mantissa Byte bitwise_and 0b01111111$
State $WriteBuffer[ThreadOutputPos[t] - BlockOutputPos[b]] $
7em $(Sign bitwise_left_shift 8) bitwise_or$
7em $(Exponent bitwise_left_shift 7) bitwise_or Mantissa$
State $BitOffset BitOffset + CodeLengths[Exponent]$
State $ThreadOutputPos[t] ThreadOutputPos[t] + 1$
EndWhile
EndFor
2.9em gray$$ Perform coalesced writes to HBM:
State $Outputs[BlockOutputPos[b](BlockOutputPos[b+1]-1)] $
4.3em $WriteBuffer[0(BlockOutputPos[b+1]-BlockOutputPos[b]-1)]$
EndFor
Table 4. System specifications of servers used for experiments.
GPUGPU MemoryCPUCPU Memory
Server 1NVIDIA RTX A500024564MiBAMD EPYC 7513 32-Core504GB
Server 2NVIDIA A10040960MiBAMD EPYC 7742 64-Core1.48TB
Server 3NVIDIA Quadro RTX 800049152MiBAMD EPYC 7742 64-Core1.48TB

Hardware for Experiments

Table 4 presents the hardware configuration of servers used for experiments.

DFloat11 Compression Time

Table 5. Compression time per transformer block for different models.
ModelCompression Time per Transformer Block (s)
Llama 3.1 8B Instruct191
Llama 3.3 70B Instruct547
Llama 3.1 405B Instruct2133

Table 5 reports the time required to compress a single transformer block for models of different sizes. Compression is a one-time preprocessing step for each model and is performed using a single CPU thread. Since transformer blocks are independent in terms of weight storage, their compression can be parallelized across multiple CPU threads, making the overall process highly scalable and efficient.

Figure 10. Comparison of average latency and throughput for token decoding between the original (BF16) models and their losslessly compressed (DF11) counterparts. The BF16 and DF11 models are run on the same GPU configurations, with Flash Attention [51] turned on for both methods.

GPU Inference Efficiency Comparison: BF16 vs. DF11

We present the GPU inference efficiency of BF16 and DF11 models in Figure 10, for various models and batch sizes on A100 GPUs.

Table 6. INT8 quantization error on different tasks. "Math" denotes MATH Hard with 2 shots. "GPQA CoT" is with 2 shots. "Δ" denotes the error gap via INT8 quantization.
ModelData TypeMathGPQA CoT
Llama-3.1-8B-InstructBF1623.9215.18
INT819.9214.06
Δ4.01.12

Impact of Lossy Quantization

An accuracy comparison of the original and INT8-quantized Llama model is presented in table 6.

Efficient Decoding of Huffman Codes Using Compact Lookup Tables

The Dual Lookup Table Approach

Figure 11. Decoding Huffman codes can be performed either by traversing the Huffman tree or by using two lookup tables: one that maps each L-bit binary code to its corresponding symbol, and another that stores the code length for each symbol.

Huffman decoding can be performed by traversing the Huffman tree: starting from the root, each bit of the encoded bitstream determines the branch to follow, and the symbol is fully decoded upon reaching a leaf node. While this bit-by-bit traversal is conceptually simple, it is inefficient in practice. Each branching decision depends on the previous one, leading to frequent memory accesses and conditional jumps. This pattern is especially problematic on GPUs, where it causes branch divergence and limits instruction-level parallelism. A widely adopted alternative is lookup-table-based decoding [27], which flattens the Huffman tree into two compact lookup tables. This enables decoding of each symbol using just two array lookups and a bit shift, significantly improving throughput.

We employ two lookup tables, LUT and CodeLengths, to achieve efficient, branch-free Huffman decoding. Let L denote the length of the longest codeword in the Huffman codebook. We construct the primary lookup table LUT as an array of size 2L, where each entry maps an L-bit binary sequence to the first symbol it encodes.

Figure 11 shows an example with L=4 and a set of symbols A, B, C, D, E, F. For clarity, we use letters to represent symbols, though in practice these correspond to exponent values in BFloat16 weights. The lookup table LUT contains 24=16 entries, indexed by all possible 4-bit binary sequences. Each entry in LUT stores the symbol whose Huffman code matches the prefix of that index. If a symbol's Huffman code is shorter than L bits, it will fill multiple consecutive entries. For example, if symbol A is encoded as the single bit 0, then all binary sequences from 0000 to 0111 begin with 0, so entries 0 through 7 in LUT are assigned to A. In contrast, symbols with Huffman codes of length L occupy exactly one entry each. For instance, E=1110 and F=1111 map to entries 14 and 15, respectively. This construction yields a dense prefix table that allows decoding a symbol with a single array lookup using an L-bit segment from the encoded bitstream.

To advance the encoded bitstream for decoding the next symbol, we also store the code lengths of all symbols. The second lookup table, CodeLengths, maps each symbol to its Huffman code length. In the example, the lengths are: A:1, B:3, C:3, D:3, E:4, F:4. Together, these two tables allow fast, deterministic decoding by repeating the following steps:

  1. Use the next L bits from the encoded bitstream to index LUT and retrieve the decoded symbol.
  2. Look up the code length of the decoded symbol from CodeLengths to determine how many bits to consume.
  3. Advance the encoded bitstream and repeat.

This approach eliminates conditional branches and pointer chasing during decoding, making it highly suitable for parallel computation on GPUs.

Decomposing LUT into Hierarchical, Compact Lookup Tables

The primary lookup table LUT contains 2L entries, where L is the maximum code length in the Huffman codebook. While this enables constant-time decoding, the table size grows exponentially with L. In practice, L ranges from 24 to 32 for Huffman trees built with BFloat16 exponents. This results in table sizes of 224 to 232 entries, which far exceeds the capacity of GPU SRAM. To address this, we decompose LUT into multiple smaller lookup tables that fit within on-chip memory, while still enabling fast decoding.

Hierarchical Table Structure
Figure 12. A Huffman tree can be decomposed into a hierarchy of subtrees, each represented by a compact lookup table (LUT). Each LUT may reference another lower-level LUT in the hierarchy. This hierarchical decoding approach is functionally equivalent to using a single monolithic LUT, but significantly more memory efficient.

Instead of storing a single flat table of size 2L, we decompose LUT into a hierarchy of compact lookup tables. Each table corresponds to a subtree of the Huffman tree and is responsible for decoding b bits. Each table processes the next b bits and either enumerate*[label=()] directly returns a decoded symbol, or delegates to a table next in the hierarchy for decoding the next b bits. enumerate* This hierarchical organization mirrors the structure of the original Huffman tree and significantly reduces total memory usage.

Figure 12 illustrates an example where the Huffman tree is partitioned into three subtrees, each mapped to a separate lookup table responsible for 2 bits. The decoding process using these three LUTs proceeds as follows:

  • LUT0​: Uses the first and second bits of the encoded bitstream to determine how to proceed, leading to 3 possible cases: itemize
  • 00, 01 decode the next symbol as A.
  • 10 delegate to LUT1​ .
  • 11 delegate to LUT2​.

LUT1​: Uses the third and fourth bits of the encoded bitstream to continue decoding:

  • 00, 01 decode the next symbol as B
  • 10, 11 decode the next symbol as C

LUT2​: Uses the third and fourth bits of the encoded bitstream to continue decoding:

  • 00, 01 decode the next symbol as D
  • 10 decode the next symbol as E
  • 11 decode the next symbol as F

itemize

For decoding Huffman-coded BFloat16 exponents, we decompose the LUT into multiple compact lookup tables, each responsible for decoding 8 bits (i.e. b=8). This allows us to read the next byte from the encoded bitstream and perform an array lookup from a 256-entry array in each step. In practice, the decomposition of LUT leads to 4 to 8 compact LUTs, each with 256 entries, which comfortably fits within fast SRAM.

Text-to-image Results of BF16 and DF11 Diffusion Models

Figure 13. Images generated by Stable Diffusion 3.5 Large in the original BFloat16 precision (top 5) are pixel-wise identical to those produced by the DFloat11-compressed model (bottom 5), using the same prompt and random seed.

Figure 13 presents the comparison of images generated using Stable Diffusion 3.5 Large in BFloat16 and DFloat11 weight format. The images are pixel-wise identical, when using the same prompt and random seed.

Limitations

This work focuses exclusively on losslessly compressing BFloat16 weights. We do not consider other formats such as FP32, FP16, or FP8, which may require different compression strategies.

While DF11 improves memory efficiency, it introduces a small but non-zero latency overhead due to decompression. This overhead is amortized at larger batch sizes but may impact latency-sensitive applications with small batches.

Our evaluation is limited to GPUs. We do not assess performance on other hardware such as CPUs, TPUs, or custom accelerators, which may require platform-specific optimizations.

References

  1. Jingfeng Yang, Haongye Jin, Ruixiang Tang, Xiaotian Han, Qizhang Feng, Haoming Jiang, Shaochen Zhong, Bing Yin, Xia Hu. Harnessing the power of LLMs in practice: A survey on ChatGPT and beyond. article. 2024. https://www.amazon.science/publications/harnessing-the-power-of-llms-in-practice-a-survey-on-chatgpt-and-beyond
  2. Yang, Ling, Zhang, Zhilong, Song, Yang, Hong, Shenda, Xu, Runsheng, Zhao, Yue, Zhang, Wentao, Cui, Bin, Yang, Ming-Hsuan. Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys. 2023.
  3. Grattafiori, Aaron, Dubey, Abhimanyu, Jauhri, Abhinav, Pandey, Abhinav, Kadian, Abhishek, Al-Dahle, Ahmad, Letman, Aiesha, Mathur, Akhil, Schelten, Alan, Vaughan, Alex, others. The llama 3 herd of models. arXiv preprint arXiv:2407.21783. 2024.
  4. Frantar, Elias, Ashkboos, Saleh, Hoefler, Torsten, Alistarh, Dan. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323. 2022.
  5. Lin, Ji, Tang, Jiaming, Tang, Haotian, Yang, Shang, Chen, Wei-Ming, Wang, Wei-Chen, Xiao, Guangxuan, Dang, Xingyu, Gan, Chuang, Han, Song. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of Machine Learning and Systems. 2024.
  6. Li, Xiuyu, Liu, Yijiang, Lian, Long, Yang, Huanrui, Dong, Zhen, Kang, Daniel, Zhang, Shanghang, Keutzer, Kurt. Q-diffusion: Quantizing diffusion models. Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.
  7. Sui, Yang, Li, Yanyu, Kag, Anil, Idelbayev, Yerlan, Cao, Junli, Hu, Ju, Sagar, Dhritiman, Yuan, Bo, Tulyakov, Sergey, Ren, Jian. Bitsfusion: 1.99 bits weight quantization of diffusion model. arXiv preprint arXiv:2406.04333. 2024.
  8. Li, Shiyao, Ning, Xuefei, Wang, Luning, Liu, Tengxuan, Shi, Xiangsheng, Yan, Shengen, Dai, Guohao, Yang, Huazhong, Wang, Yu. Evaluating Quantized Large Language Models. International Conference on Machine Learning. 2024.
  9. Xiao, Guangxuan, Lin, Ji, Seznec, Mickael, Wu, Hao, Demouth, Julien, Han, Song. Smoothquant: Accurate and efficient post-training quantization for large language models. International Conference on Machine Learning. 2023.
  10. Guo, Daya, Yang, Dejian, Zhang, Haowei, Song, Junxiao, Zhang, Ruoyu, Xu, Runxin, Zhu, Qihao, Ma, Shirong, Wang, Peiyi, Bi, Xiao, others. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948. 2025.
  11. Liu, Ruikang, Sun, Yuxuan, Zhang, Manyi, Bai, Haoli, Yu, Xianzhi, Yu, Tiezheng, Yuan, Chun, Hou, Lu. Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models. arXiv preprint arXiv:2504.04823. 2025.
  12. Abhinav Dutta, Sanjeev Krishnan, Nipun Kwatra, Ramachandran Ramjee. Accuracy is Not All You Need. The Thirty-eighth Annual Conference on Neural Information Processing Systems. 2024. https://openreview.net/forum?id=QVG7j29Sta
  13. Yang, An, Yang, Baosong, Zhang, Beichen, Hui, Binyuan, Zheng, Bo, Yu, Bowen, Li, Chengyuan, Liu, Dayiheng, Huang, Fei, Wei, Haoran, others. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115. 2024.
  14. Cobbe, Karl, Kosaraju, Vineet, Bavarian, Mohammad, Chen, Mark, Jun, Heewoo, Kaiser, Lukasz, Plappert, Matthias, Tworek, Jerry, Hilton, Jacob, Nakano, Reiichiro, Hesse, Christopher, Schulman, John. Training Verifiers to Solve Math Word Problems. arXiv preprint arXiv:2110.14168. 2021.
  15. Kharinaev, Artyom, Moskvoretskii, Viktor, Shvetsov, Egor, Studenikina, Kseniia, Mikhail, Bykov, Burnaev, Evgeny. Investigating the impact of quantization methods on the safety and reliability of large language models. arXiv preprint arXiv:2502.15799. 2025.
  16. Han, Song, Mao, Huizi, Dally, William J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. 2015.
  17. Hershcovitch, Moshik, Wood, Andrew, Choshen, Leshem, Girmonsky, Guy, Leibovitz, Roy, Ennmouri, Ilias, Malka, Michal, Chin, Peter, Sundararaman, Swaminathan, Harnik, Danny. Zipnn: Lossless compression for ai models. arXiv preprint arXiv:2411.05239. 2024.
  18. Yubeaton, Patrick, Mahmoud, Tareq, Naga, Shehab, Taheri, Pooria, Xia, Tianhua, George, Arun, Khalil, Yasmein, Zhang, Sai Qian, Joshi, Siddharth, Hegde, Chinmay, others. Huff-LLM: End-to-End Lossless Compression for Efficient LLM Inference. arXiv preprint arXiv:2502.00922. 2025.
  19. Zhuang Wang, Zhen Jia, Shuai Zhang, Zhen Zhang, Mason Fu, T. S. Eugene Ng, Yida Wang. Gemini: Fast failure recovery in distributed training with in-memory checkpoints. article. 2023. https://www.amazon.science/publications/gemini-fast-failure-recovery-in-distributed-training-with-in-memory-checkpoints
  20. Huffman, David A. A method for the construction of minimum-redundancy codes. Proceedings of the IRE. 1952.
  21. Qwen Team. Qwen3: Think Deeper, Act Faster. misc. 2025. https://qwenlm.github.io/blog/qwen3/
  22. Mistral AI Team. Mistral Small 3. misc. 2025.
  23. Black Forest Labs. FLUX. misc. 2024.
  24. Stability AI. Introducing Stable Diffusion 3.5. misc. 2024.
  25. Langdon, G. G.. An Introduction to Arithmetic Coding. IBM Journal of Research and Development. 1984. https://doi.org/10.1147/rd.282.0135
  26. Duda, Jarek. Asymmetric numeral systems: entropy coding combining speed of huffman coding with compression rate of arithmetic coding. arXiv preprint arXiv:1311.2540. 2013.
  27. Yamamoto, Naoya, Nakano, Koji, Ito, Yasuaki, Takafuji, Daisuke, Kasagi, Akihiko, Tabaru, Tsuguchika. Huffman coding with gap arrays for GPU acceleration. Proceedings of the 49th International Conference on Parallel Processing. 2020.
  28. Blelloch, Guy E. Scans as primitive parallel operations. IEEE Transactions on computers. 1989.
  29. Wolf, Thomas, Debut, Lysandre, Sanh, Victor, Chaumond, Julien, Delangue, Clement, Moi, Anthony, Cistac, Pierric, Rault, Tim, Louf, R\'emi, Funtowicz, Morgan, others. Transformers: State-of-the-art natural language processing. Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations. 2020.
  30. Mistral AI Team. Mistral NeMo. misc. 2024.
  31. Abdin, Marah, Aneja, Jyoti, Behl, Harkirat, Bubeck, S\'ebastien, Eldan, Ronen, Gunasekar, Suriya, Harrison, Michael, Hewett, Russell J, Javaheripi, Mojan, Kauffmann, Piero, others. Phi-4 technical report. arXiv preprint arXiv:2412.08905. 2024.
  32. Gao, Leo, Tow, Jonathan, Abbasi, Baber, Biderman, Stella, Black, Sid, DiPofi, Anthony, Foster, Charles, Golding, Laurence, Hsu, Jeffrey, Le Noac'h, Alain, Li, Haonan, McDonell, Kyle, Muennighoff, Niklas, Ociepa, Chris, Phang, Jason, Reynolds, Laria, Schoelkopf, Hailey, Skowron, Aviya, Sutawika, Lintang, Tang, Eric, Thite, Anish, Wang, Ben, Wang, Kevin, Zou, Andy. A framework for few-shot language model evaluation. Zenodo. 2024. https://zenodo.org/records/12608602
  33. Hendrycks, Dan, Burns, Collin, Basart, Steven, Zou, Andy, Mazeika, Mantas, Song, Dawn, Steinhardt, Jacob. Measuring Massive Multitask Language Understanding. International Conference on Learning Representations.
  34. Lin, Stephanie, Hilton, Jacob, Evans, Owain. TruthfulQA: Measuring How Models Mimic Human Falsehoods. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022.
  35. Merity, Stephen, Xiong, Caiming, Bradbury, James, Socher, Richard. Pointer Sentinel Mixture Models. International Conference on Learning Representations. 2017.
  36. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research. 2020. http://jmlr.org/papers/v21/20-074.html
  37. NVIDIA Corporation. nvCOMP: GPU-Accelerated Compression and Decompression Library. misc. 2025.
  38. Dettmers, Tim, Pagnoni, Artidoro, Holtzman, Ari, Zettlemoyer, Luke. Qlora: Efficient finetuning of quantized llms. Advances in neural information processing systems. 2023.
  39. Yoshida, Davis. NF4 Isn't Information Theoretically Optimal (and that's Good). arXiv preprint arXiv:2306.06965. 2023.
  40. Dotzel, Jordan, Chen, Yuzong, Kotb, Bahaa, Prasad, Sushma, Wu, Gang, Li, Sheng, Abdelfattah, Mohamed S, Zhang, Zhiru. Learning from Students: Applying t-Distributions to Explore Accurate and Efficient Formats for LLMs. International Conference on Machine Learning. 2024.
  41. Frantar, Elias, Alistarh, Dan. Sparsegpt: Massive language models can be accurately pruned in one-shot. International Conference on Machine Learning. 2023.
  42. Hao, Yongchang, Cao, Yanshuai, Mou, Lili. NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks. arXiv preprint arXiv:2410.20650. 2024.
  43. Gong, Ruihao, Yong, Yang, Gu, Shiqiao, Huang, Yushi, Lv, Chengtao, Zhang, Yunchen, Liu, Xianglong, Tao, Dacheng. Llmc: Benchmarking large language model quantization with a versatile compression toolkit. arXiv preprint arXiv:2405.06001. 2024.
  44. Ge Yang, Changyi He, Jinyang Guo, Jianyu Wu, Yifu Ding, Aishan Liu, Haotong Qin, Pengliang Ji, Xianglong Liu. LLMCBench: Benchmarking Large Language Model Compression for Efficient Deployment. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track. 2024. https://openreview.net/forum?id=wmO7z57wNK
  45. Jin, Renren, Du, Jiangcun, Huang, Wuwei, Liu, Wei, Luan, Jian, Wang, Bin, Xiong, Deyi. A comprehensive evaluation of quantization strategies for large language models. Findings of the Association for Computational Linguistics ACL 2024. 2024.
  46. Fujii, Kazuki, Nakamura, Taishi, Yokota, Rio. Balancing Speed and Stability: The Trade-offs of FP8 vs. BF16 Training in LLMs. arXiv preprint arXiv:2411.08719. 2024.
  47. Kalamkar, Dhiraj, Mudigere, Dheevatsa, Mellempudi, Naveen, Das, Dipankar, Banerjee, Kunal, Avancha, Sasikanth, Vooturi, Dharma Teja, Jammalamadaka, Nataraj, Huang, Jianyu, Yuen, Hector, others. A study of BFLOAT16 for deep learning training. arXiv preprint arXiv:1905.12322. 2019.
  48. Dettmers, Tim, Lewis, Mike, Belkada, Younes, Zettlemoyer, Luke. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in neural information processing systems. 2022.
  49. Deutsch, P., Gailly, J.-L.. RFC1950: ZLIB Compressed Data Format Specification version 3.3. RFC Editor. 1996.
  50. Xu, Mengwei, Yin, Wangsong, Cai, Dongqi, Yi, Rongjie, Xu, Daliang, Wang, Qipeng, Wu, Bingyang, Zhao, Yihao, Yang, Chen, Wang, Shihe, others. A survey of resource-efficient llm and multimodal foundation models. arXiv preprint arXiv:2401.08092. 2024.
  51. Dao, Tri, Fu, Dan, Ermon, Stefano, Rudra, Atri, R\'e, Christopher. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in neural information processing systems. 2022.
  52. Tianyi Zhang, Jonah Wonkyu Yi, Bowen Yao, Zhaozhuo Xu, Anshumali Shrivastava. NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention. The Thirty-eighth Annual Conference on Neural Information Processing Systems. 2024. https://openreview.net/forum?id=4xDxVQHsbZ
  53. Tianyi Zhang, Anshumali Shrivastava. LeanQuant: Accurate and Scalable Large Language Model Quantization with Loss-error-aware Grid. The Thirteenth International Conference on Learning Representations. 2025. https://openreview.net/forum?id=ISqx8giekS
  54. Tianyi Zhang, Jonah Wonkyu Yi, Zhaozhuo Xu, Anshumali Shrivastava. KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization. The Thirty-eighth Annual Conference on Neural Information Processing Systems. 2024. https://openreview.net/forum?id=pNnvzQsS4P
  55. Coleman Richard Charles Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Sophia Shao, Kurt Keutzer, Amir Gholami. KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization. The Thirty-eighth Annual Conference on Neural Information Processing Systems. 2024.
  56. Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, Xia Hu. KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache. Forty-first International Conference on Machine Learning. 2024.
  57. Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models. International Conference on Learning Representations. 2022. https://openreview.net/forum?id=nZeVKeeFYf9
  58. Tianyi Zhang, Junda Su, Aditya Desai, Oscar Wu, Zhaozhuo Xu, Anshumali Shrivastava. Sketch to Adapt: Fine-Tunable Sketches for Efficient LLM Adaptation. Forty-second International Conference on Machine Learning. 2025. https://openreview.net/forum?id=zZXOXhxO6I
  59. Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, Yuandong Tian. GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection. Forty-first International Conference on Machine Learning. 2024. https://openreview.net/forum?id=hYHsrKDiX7
  60. Haochen Zhang, Junze Yin, Guanchu Wang, Zirui Liu, Lin Yang, Tianyi Zhang, Anshumali Shrivastava, Vladimir Braverman. Breaking the Frozen Subspace: Importance Sampling for Low-Rank Optimization in LLM Pretraining. The Thirty-ninth Annual Conference on Neural Information Processing Systems. 2025. https://openreview.net/forum?id=ZdmmOAN4h3
  61. Xia, Heming, Ge, Tao, Wang, Peiyi, Chen, Si-Qing, Wei, Furu, Sui, Zhifang. Speculative decoding: Exploiting speculative execution for accelerating seq2seq generation. arXiv preprint arXiv:2203.16487. 2022.
  62. Yaniv Leviathan, Matan Kalman, Yossi Matias. Fast Inference from Transformers via Speculative Decoding. Fortieth International Conference on Machine Learning. 2023.
  63. Xia, Heming, Yang, Zhe, Dong, Qingxiu, Wang, Peiyi, Li, Yongqi, Ge, Tao, Liu, Tianyu, Li, Wenjie, Sui, Zhifang. Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding. arXiv preprint arXiv:2401.07851. 2024.
  64. Yichao Fu, Peter Bailis, Ion Stoica, Hao Zhang. Break the Sequential Dependency of LLM Inference Using Lookahead Decoding. Forty-first International Conference on Machine Learning. 2024. https://lmsys.org/blog/2023-11-21-lookahead-decoding/
  65. Anonymous. FAFO: Lossy KV Cache Compression for Lossless Inference Acceleration via Draftless Fumble Decoding. Submitted to The Fourteenth International Conference on Learning Representations. 2025. https://openreview.net/forum?id=oSk9tP5Mgs