The GGUF format has emerged as the standard for efficient LLM deployment, enabling local inference on consumer hardware through optimized quantization schemes. However, support for heterogeneous quantization and advanced model compression techniques in GGML can still be improved to match or surpass the state-of-the-art academic results. In turn, this can improve the quality-compression tradeoffs.
This repository bridges that gap by providing DASLab's state-of-the-art model compression techniques for GGUF models. We implement:
- GPTQ quantization with direct GGUF export, offering improved quality compared to naive quantization approaches, by enabling error-correcting updates during the calibration process
- EvoPress evolutionary search for discovering optimal per-layer quantization configurations
- Model assembly tools for converting research configurations into production-ready GGUF models
The toolkit enables researchers and practitioners to move beyond uniform quantization, systematically exploring mixed-precision configurations that achieve better quality-compression tradeoffs. By bringing advanced compression research to the GGUF ecosystem, we aim to democratize access to efficient, high-quality quantized models.
Our intent is to provide an open source implementation of GGUF dynamic quantization that enables non-uniform bitwidth optimization, with performance on par with existing methods. This previously existed only in proprietary tools and fills a gap for the community, allowing lossless or near-lossless models at low bitwidths with OSS methods.
The toolkit follows a three-stage pipeline for optimized GGUF quantization, creating the database for the search, searching and reassambling the model from the found configuration.
The goal of this first step is to create an accurate quantized model with uniform per-layer quantization. Specifically, it enables to create quantized model variants using either:
- Standard GGUF quantization formats (Q2_K, Q3_K, Q4_K, Q5_K, Q6_K) via the
llama-quantize
tool - GPTQ quantization with K-quant export for higher quality compression
The second step enables to create a high-quality non-uniformly quantized model. Specifically, one can build a database of quantized layers and use the EvoPress algorithm to discover optimal per-layer configurations:
- Convert GGUF models into searchable layer databases using
mapper/
utilities - Run evolutionary search to find mixed-precision configurations that maximize quality under size constraints
- Generate configuration files specifying bitwidth per layer
Transform EvoPress configurations into functional GGUF models:
- Use
mapper/gguf_stitcher.py
to reconstruct complete models from optimized layer configurations - Handle HuggingFace ↔ GGUF naming conversions and metadata preservation
- Output production-ready GGUF files for deployment
The eval/
directory provides an evaluation tool for perplexity testing, zero-shot benchmarks, and model inspection throughout the process.
This project explores GPTQ quantization combined with the K-Quants provided by the GGUF format.
The main goal is to extend GPTQ with K-Quants (Q2_K
, Q3_K
, Q4_K
, Q5_K
, Q6_K
) to improve model quality and enable the first generation of GPTQ-GGUF models. This is non-trivial because of the non-trivial complexity of some GGUF formats.
- GPTQ: flexible, supports symmetric/asymmetric quantization and varying group sizes.
- K-Quants: asymmetric (except for Q3_K and Q5_K), fixed group sizes, double-quantization of scales.
By combining the two:
- We can calculate K-Quant scales and offsets without using an i-matrix.
- Then apply GPTQ for further improvements.
- This results in higher-quality GGUF quantized models.
For background, see the original GPTQ
paper and the description of K-Quants in the llama.cpp repo or in the Mind the Gap: A Practical Attack on GGUF Quantization
paper.
To quantize a model, use the provided script:
bash run_quant.sh "Q4_K"
You can override defaults via environment variables (e.g. CUDA_VISIBLE_DEVICES=0,1
) or by editing the script.
-
--quantizable_modules '.*layers.*((q|k|v|o|gate|up|down)_proj)$'
A regex defining which modules to quantize. Here it targets all projection layers inside transformer blocks (q, k, v, o, gate, up, down
). -
--pre_block_modules model.embed_tokens
Modules applied before transformer blocks (e.g. embeddings). -
--block_modules model.layers
The main transformer block stack. -
--post_block_modules lm_head
Modules applied after transformer blocks (e.g. final classifier / output head). -
--quant_non_block_modules
If set, quantizes non-transformer block modules as well (pre_block_modules + post_block_modules). -
--default_bit_width "${BITS:-Q4_K}"
Default quantization scheme if no custom config is provided (e.g.Q4_K
,Q6_K
). -
--bit_width_configuration "${BIT_WIDTH_CONFIGURATION:-./config.json}"
JSON file allowing per-layer bitwidth control. Example:{ "q_proj": "Q3_K", "k_proj": "Q2_K", "v_proj": "Q4_K", "o_proj": "Q5_K", "gate_proj": "Q6_K", "down_proj": "Q3_K", "up_proj": "Q4_K", "embed_tokens": "Q6_K", "lm_head": "Q6_K" }
After quantization, results are stored in a directory (containing quantized weights and scale/offset metadata).
To package this into a .gguf
file, use the adapted packing script:
python pack_gptq_into_gguf.py ./path-to-original-model \
--dir_model_quant ./path-to-quantized-model \
--outfile ./model.gguf \
--outtype f16
Notes:
- Q3_K and Q5_K are symmetric → shifted only during packing.
- Other bitwidths are shifted already during quantization.
- See llama.cpp code for detailed handling of K-Quant differences.
We provide a simple script to export existing GPTQ models stored in the compressed tensors format into GGUF using Q4_0 quants. Therefore, this script currently supports models that do not use dynamic reordering of scales during (de-)quantization and that are symmetric within 4-bit quantization. These are the most commonly produced models
Support for asymmetric quantization can be added with minimal effort, and the script has been designed to make such extensions straightforward. The implementation is adapted from the original GPTQ packing logic.
python pack_compressed_tensors_into_gguf.py ./path-to-original-model \
--dir_model_quant ./path-to-quantized-model \
--outfile ./model.gguf \
--qtype q4_0 \
--outtype f16
Notably, the script can also be adapted to support other storage formats that use different naming conventions for the individual parts of the quantized weights. To do this, simply adjust the mappings in lines 316–332 of the script.
Here's an example using meta-llama/Llama-3.2-1B-Instruct
to demonstrate the full quantization optimization process:
Disclaimer: This is a simplified example for demonstration. Actual results may vary based on model, data, and parameters. Please adapt as needed.
The first step creates multiple quantized versions of your model. You have two approaches:
cd quant/gptq
# Run GPTQ quantization - this will create multiple bitwidth variants
./run_quant.sh "Q2_K"
./run_quant.sh "Q3_K"
./run_quant.sh "Q4_K"
./run_quant.sh "Q5_K"
./run_quant.sh "Q6_K"
# Creates:
# - Llama-3.2-1B-Instruct_Q2_K.gguf
# - Llama-3.2-1B-Instruct_Q4_K.gguf
# - Llama-3.2-1B-Instruct_Q6_K.gguf
# ...
cd quant/gguf
# Quantize to multiple formats
./run_quant.sh ./models/Llama-3.2-1B-Instruct.gguf "Q2_K Q3_K Q4_K Q5_K Q6_K"
# Creates:
# - Llama-3.2-1B-Instruct_Q2_K.gguf
# - Llama-3.2-1B-Instruct_Q4_K.gguf
# - Llama-3.2-1B-Instruct_Q6_K.gguf
# ...
The database creation process organizes quantized layers for efficient search. We use the automated database builder:
cd mapper
# Build the database from multiple quantized models (both GPTQ and K-Quants are supported)
./build_ep_database.sh --models \
../quant/gguf/quantized_models/Llama-3.2-1B-Instruct_Q2_K.gguf \
../quant/gguf/quantized_models/Llama-3.2-1B-Instruct_Q3_K.gguf \
../quant/gguf/quantized_models/Llama-3.2-1B-Instruct_Q4_K.gguf \
../quant/gguf/quantized_models/Llama-3.2-1B-Instruct_Q5_K.gguf \
../quant/gguf/quantized_models/Llama-3.2-1B-Instruct_Q6_K.gguf \
--output-dir ./ep_database
# This creates a unified database structure:
# ep_database/
# ├── models/ # Original GGUF files
# ├── layers-gguf/ # GGUF format layers organized by bitwidth
# │ ├── blk.0.attn_q.weight/
# │ │ ├── 2.5625-Q2_K.pth
# │ │ ├── 3.4375-Q3_K.pth
# │ │ ├── 4.5-Q4_K.pth
# │ │ └── ...
# ├── layers-hf/ # HuggingFace format layers (for EvoPress)
# └── manifest.json # Database metadata and structure (for debugging)
The build process automatically:
- Extracts layers from each GGUF model
- Organizes them by quantization level
- Creates both GGUF and HuggingFace naming conventions
- Generates metadata for efficient search
EvoPress uses evolutionary search to discover optimal per-layer bitwidth assignments. The search balances model quality against compression constraints (for further details, see the EvoPress repo).
cd evopress
# Run evolutionary search targeting 4-bit average compression
python evo_quant_search.py \
--model_name_or_path meta-llama/Llama-3.2-1B-Instruct \
--quant_weights_path ../mapper/ep_database/layers-hf \
--target_bitwidth 4.0 \
--generations 50 \
--offspring 128 \
--survivors_per_selection 16 4 1 \
--tokens_per_selection 2048 16384 131072 \
--calibration_data fineweb_edu \
--calibration_tokens 2097152 \
--fitness_fn kl \
--eval_datasets fineweb_edu wikitext2 c4
# Search process explanation:
# - Starts with random layer configurations averaging 4-bit compression
# - Each generation creates 128 offspring with mutated bitwidth assignments
# - Three-stage selection with increasing calibration data (2K → 16K → 131K tokens)
# - Uses KL divergence to measure quality degradation
# - Outputs optimal configuration after 50 generations
The search typically runs for multiple hours and produces a configuration file like:
model.layers.0.self_attn.q_proj: 2.5625 (2.5625-Q2_K.pth)
model.layers.0.self_attn.k_proj: 4.5 (4.5-Q4_K.pth)
model.layers.0.self_attn.v_proj: 6.5625 (6.5625-Q6_K.pth)
model.layers.0.self_attn.o_proj: 3.4375 (3.4375-Q3_K.pth)
model.layers.0.mlp.gate_proj: 5.5 (5.5-Q5_K.pth)
...
Convert the EvoPress output (using HuggingFace layer names) to GGUF-compatible naming:
cd mapper
# Convert HF naming convention to GGUF layer names
python config_converter.py \
../evopress/evo-kl-configuration-4.0.txt \
-o gguf_config.txt \
--verbose
# The converter handles mapping:
# model.layers.0.self_attn.q_proj → blk.0.attn_q.weight
# model.layers.0.self_attn.k_proj → blk.0.attn_k.weight
# model.layers.0.mlp.gate_proj → blk.0.ffn_gate.weight
# etc.
Result: gguf_config.txt
with GGUF-compatible layer names:
blk.0.attn_q.weight: 2.5625 (2.5625-Q2_K.pth)
blk.0.attn_k.weight: 4.5 (4.5-Q4_K.pth)
blk.0.attn_v.weight: 6.5625 (6.5625-Q6_K.pth)
...
Reconstruct the complete GGUF model using the optimized layer configuration:
Important: You might want to adapt the final configuration, to add specific bitwidths for non-quantized layers (e.g. embeddings, lm_head) or to enforce a minimum bitwidth.
Adding the configuration of the last layers in the desired quantization is necessary. As an example, the following layers are added to gguf_config.txt
:
token_embd.weight: 5.5 (5.5-Q5_K.pth)
rope_freqs.weight: 32 (32-F32.pth)
output.weight: 6.5625 (6.5625-Q6_K.pth)
output_norm.weight: 32 (32-F32.pth)
The model can be stitched together afterwards.
# Stitch together the final optimized model
python gguf_stitcher.py \
./ep_database \
./Llama-3.2-1B-Instruct-optimized.gguf \
--config gguf_config.txt \
--original-model ../quant/gguf/models/Llama-3.2-1B-Instruct.gguf \
--default-quant-type "16 (16-F16.pth)"
# The stitcher:
# - Loads each layer at its specified bitwidth from the database
# - Preserves original model metadata (vocab, architecture, etc.)
# - Ensures proper GGUF format compliance
# - Validates layer compatibility and dimensions
The final model Llama-3.2-1B-Instruct-DASLab-4bit.gguf
is ready for deployment with llama.cpp or other GGUF-compatible inference engines.
Comprehensive evaluation to verify the optimization results:
cd eval
# Perplexity evaluation across multiple datasets
python ppleval.py \
--model_name_or_path meta-llama/Llama-3.2-1B-Instruct \
--quant_weights_path ../mapper/ep_database/layers-hf \
--quant_config_path ../mapper/gguf_config.txt \
--eval_datasets wikitext2 c4 fineweb_edu \
--eval_tokens 524288 \
--output_file optimization_results.json
# Zero-shot task evaluation
python lmeval.py \
--model_args "pretrained=meta-llama/Llama-3.2-1B-Instruct" \
--quant_weights_path ../mapper/ep_database/layers-hf \
--quant_config_path ../mapper/gguf_config.txt \
--tasks "hellaswag,arc_easy,winogrande,piqa,truthfulqa_mc" \
--output_path evaluation_results.json
# Compare against baseline uniform quantization
python lmeval.py \
--model_args "pretrained=meta-llama/Llama-3.2-1B-Instruct" \
--quant_weights_path ../mapper/ep_database/layers-hf \
--quant_default_level 4 \
--tasks "hellaswag,arc_easy,winogrande,piqa,truthfulqa_mc" \
--output_path baseline_results.json
This section presents evaluation results for Llama 3.2 1B Instruct and more importantly Llama 3.1 8B Instruct, where we compare our quantization methods against Unsloth's Dynamic 2.0 GGUFs, which represent some of the most popular non-uniform GGUF quantizations available in the community.
The evaluation for the 8B model compares our GPTQ and EvoPress quantization approaches against the very popular Unsloth dynamic quantizations. To ensure a fair comparison, we matched the exact bitwidths used by Unsloth's models. While not intended as an exhaustive comparison, these results demonstrate that our toolset produces models with comparable or superior performance characteristics. All EvoPress evaluations utilized importance matrix (I-Matrix) calibration to enhance quantization quality. The importance matrix identifies and preserves critical weights during compression by sampling activation statistics across calibration data, ensuring minimal quantization errors in the most impactful model areas. The I-Matrix approach is particularly effective for mixed-precision quantization, as the search can utilize lower bitwidth quantizations better.
This section presents evaluation results for Llama 3.1 8B Instruct, comparing our quantization methods against Unsloth's Dynamic 2.0 GGUFs (Llama-3.1-8B-Instruct-GGUF), which represent some of the most popular non-uniform GGUF quantizations available in the community.
The following tables present perplexity measurements across three standard datasets, comparing our methods against Unsloth's dynamic quantizations. Lower perplexity values indicate better model quality preservation.
Method | Average Bit Width | C4 PPL | FineWeb-Edu PPL | WikiText2 PPL |
---|---|---|---|---|
Baseline | ||||
F32 (Full Precision) | 32.00 | 11.01 | 9.14 | 6.73 |
Unsloth Dynamic Quantizations | ||||
UD-IQ2_M | 2.71 | 13.99 | 11.20 | 8.41 |
UD-Q2_K_XL | 3.01 | 13.75 | 11.13 | 8.29 |
UD-Q3_K_XL | 3.97 | 11.85 | 9.56 | 7.11 |
UD-Q4_K_XL | 4.88 | 11.23 | 9.30 | 6.84 |
UD-Q5_K_XL | 5.66 | 11.16 | 9.18 | 6.79 |
UD-Q6_K_XL | 7.11 | 11.05 | 9.14 | 6.75 |
UD-Q8_K_XL | 9.70 | 11.01 | 9.14 | 6.74 |
Standard Unsloth K-Quants | ||||
Q2_K | 2.90 | 14.73 | 11.67 | 8.73 |
Q3_K_M | 3.84 | 11.88 | 9.66 | 7.18 |
Q4_K_M | 4.80 | 11.35 | 9.34 | 6.87 |
Q5_K_M | 5.65 | 11.18 | 9.18 | 6.79 |
Q6_K | 6.56 | 11.05 | 9.16 | 6.76 |
GPTQ Quantization | ||||
GPTQ-2 | 2.56 | 17.94 | 14.10 | 10.96 |
GPTQ-3 | 3.44 | 13.12 | 10.55 | 8.02 |
GPTQ-4 | 4.50 | 11.35 | 9.36 | 6.89 |
GPTQ-5 | 5.50 | 11.12 | 9.21 | 6.79 |
GPTQ-6 | 6.56 | 11.13 | 9.14 | 6.75 |
EvoPress with GPTQ | ||||
EvoPress-GPTQ-3 | 3.47 | 13.17 | 10.66 | 7.89 |
EvoPress-GPTQ-4 | 4.50 | 11.35 | 9.36 | 6.89 |
EvoPress-GPTQ-5 | 5.51 | 11.13 | 9.21 | 6.79 |
EvoPress with K-Quants (I-Matrix) | ||||
EvoPress-kQuant-im-2.71 | 2.71 | 14.10 | 11.29 | 8.52 |
EvoPress-kQuant-im-3 | 3.46 | 12.70 | 10.16 | 7.63 |
EvoPress-kQuant-im-3.01 | 3.01 | 13.17 | 10.66 | 7.89 |
EvoPress-kQuant-im-3.97 | 3.97 | 11.60 | 9.56 | 6.99 |
EvoPress-kQuant-im-4 | 4.51 | 11.40 | 9.36 | 6.91 |
EvoPress-kQuant-im-4.88 | 4.88 | 11.18 | 9.23 | 6.84 |
EvoPress-kQuant-im-5 | 5.50 | 11.07 | 9.16 | 6.79 |
Zero-shot task performance demonstrates how different quantization methods preserve the model's reasoning and comprehension capabilities across standard benchmarks.
Method | Avg Bit Width | ARC Easy (norm) | HellaSwag (norm) | PIQA (norm) | WinoGrande | Avg Score |
---|---|---|---|---|---|---|
Baseline | ||||||
F32 | 32.00 | 0.796 | 0.792 | 0.810 | 0.742 | 0.785 |
Unsloth Dynamic Quantizations | ||||||
UD-IQ2_M | 2.71 | 0.752 | 0.742 | 0.790 | 0.702 | 0.747 |
UD-Q2_K_XL | 3.01 | 0.768 | 0.747 | 0.786 | 0.720 | 0.755 |
UD-Q3_K_XL | 3.97 | 0.790 | 0.778 | 0.808 | 0.737 | 0.778 |
UD-Q4_K_XL | 4.88 | 0.796 | 0.787 | 0.814 | 0.732 | 0.783 |
UD-Q5_K_XL | 5.66 | 0.795 | 0.790 | 0.806 | 0.739 | 0.782 |
UD-Q6_K_XL | 7.11 | 0.795 | 0.792 | 0.807 | 0.737 | 0.783 |
UD-Q8_K_XL | 9.70 | 0.795 | 0.792 | 0.808 | 0.740 | 0.784 |
Standard Unsloth K-Quants | ||||||
Q2_K | 2.90 | 0.733 | 0.737 | 0.781 | 0.713 | 0.741 |
Q3_K_M | 3.84 | 0.782 | 0.776 | 0.806 | 0.735 | 0.775 |
Q4_K_M | 4.80 | 0.794 | 0.786 | 0.807 | 0.740 | 0.782 |
Q5_K_M | 5.65 | 0.793 | 0.790 | 0.810 | 0.738 | 0.783 |
Q6_K | 6.56 | 0.795 | 0.793 | 0.809 | 0.736 | 0.784 |
GPTQ Quantization | ||||||
GPTQ-2 | 2.56 | 0.644 | 0.684 | 0.744 | 0.669 | 0.685 |
GPTQ-3 | 3.44 | 0.773 | 0.762 | 0.789 | 0.724 | 0.762 |
GPTQ-4 | 4.50 | 0.788 | 0.787 | 0.805 | 0.735 | 0.779 |
GPTQ-5 | 5.50 | 0.796 | 0.792 | 0.810 | 0.742 | 0.785 |
GPTQ-6 | 6.56 | 0.798 | 0.791 | 0.810 | 0.744 | 0.786 |
EvoPress with GPTQ | ||||||
EvoPress-GPTQ-3 | 3.47 | 0.753 | 0.748 | 0.791 | 0.715 | 0.751 |
EvoPress-GPTQ-4 | 4.50 | 0.788 | 0.787 | 0.805 | 0.735 | 0.779 |
EvoPress-GPTQ-5 | 5.51 | 0.803 | 0.792 | 0.810 | 0.749 | 0.788 |
EvoPress with K-Quants (I-Matrix) | ||||||
EvoPress-kQuant-im-2.71 | 2.71 | 0.765 | 0.750 | 0.792 | 0.710 | 0.754 |
EvoPress-kQuant-im-3 | 3.46 | 0.783 | 0.772 | 0.795 | 0.732 | 0.770 |
EvoPress-kQuant-im-3.01 | 3.01 | 0.777 | 0.759 | 0.791 | 0.706 | 0.758 |
EvoPress-kQuant-im-3.97 | 3.97 | 0.792 | 0.784 | 0.804 | 0.743 | 0.781 |
EvoPress-kQuant-im-4 | 4.51 | 0.799 | 0.785 | 0.806 | 0.732 | 0.780 |
EvoPress-kQuant-im-4.88 | 4.88 | 0.794 | 0.788 | 0.808 | 0.737 | 0.782 |
EvoPress-kQuant-im-5 | 5.50 | 0.794 | 0.792 | 0.811 | 0.740 | 0.784 |
The evaluation results demonstrate several insights about the performance of our quantization toolset compared to Unsloth's dynamic quantizations:
Competitive Performance at Matched Bitwidths demonstrates that our GPTQ and EvoPress methods achieve comparable or superior results when evaluated at identical bitwidths. For instance, at approximately 4.88 bits, our EvoPress-kQuant-im-4.88 achieves a C4 perplexity of 11.18 compared to 11.23 for Unsloth's UD-Q4_K_XL, while maintaining similar zero-shot performance across benchmarks.
EvoPress Optimization Excellence is particularly evident at lower bitwidths where compression challenges are most severe. The evolutionary search process successfully identifies optimal per-layer configurations that preserve model quality better than uniform approaches. At 3.01 bits, our EvoPress-kQuant-im-3.01 configuration achieves significantly better perplexity than Unsloth's UD-Q2_K_XL (13.17 vs 13.75 on C4) while maintaining comparable downstream task performance.
GPTQ Foundation Strength provides robust quantization even without evolutionary optimization. Standard GPTQ quantization at 4.5 bits achieves perplexity scores that rival or exceed more complex dynamic quantization schemes, validating its effectiveness as a core quantization method in our toolkit.
We also evaluated our quantization methods on the smaller Llama 3.2 1B Instruct model to demonstrate the versatility of our toolkit across different model sizes. The results below highlight the performance of various quantization strategies, including standard K-Quants, GPTQ quantization, and EvoPress-optimized configurations.
Perplexity measurements across three standard datasets demonstrate the quality-compression tradeoffs achieved by different quantization methods. Lower perplexity values indicate better model quality.
Method | Average Bit Width | C4 PPL | FineWeb-Edu PPL | WikiText2 PPL |
---|---|---|---|---|
Standard K-Quants | ||||
Q3_K_S | 3.44 | 30.45 | 21.59 | 17.69 |
Q4_K_S | 4.54 | 21.67 | 15.95 | 12.13 |
Q4_1 | 4.50 | 21.55 | 15.92 | 12.04 |
GPTQ Quantization | ||||
GPTQ-3 | 3.44 | 29.57 | 21.22 | 16.84 |
GPTQ-4 | 4.50 | 21.25 | 15.86 | 12.16 |
GPTQ-5 | 5.50 | 21.05 | 15.70 | 11.67 |
GPTQ-6 | 6.56 | 20.80 | 15.58 | 11.58 |
EvoPress with K-Quants | ||||
EvoPress-kQuant-3 | 3.46 | 26.66 | 19.39 | 15.13 |
EvoPress-kQuant-4 | 4.50 | 21.47 | 15.92 | 12.30 |
EvoPress-kQuant-5 | 5.50 | 21.38 | 15.80 | 11.76 |
EvoPress with GPTQ | ||||
EvoPress-GPTQ-3 | 3.45 | 27.23 | 20.57 | 16.52 |
EvoPress-GPTQ-4 | 4.50 | 21.13 | 15.89 | 12.04 |
EvoPress-GPTQ-5 | 5.51 | 21.05 | 15.70 | 11.67 |
EvoPress with I-Matrix | ||||
EvoPress-kQuant-Imatrix-3 | 3.44 | 24.05 | 18.39 | 14.64 |
EvoPress-kQuant-Imatrix-4 | 4.50 | 21.55 | 15.98 | 12.04 |
EvoPress-kQuant-Imatrix-5 | 5.50 | 20.84 | 15.64 | 11.60 |
Zero-shot task performance across standard benchmarks shows how different quantization methods preserve the model's reasoning and comprehension capabilities. Higher accuracy values indicate better performance.
Method | Average Bit Width | ARC Easy | HellaSwag | HellaSwag (norm) | PIQA (norm) |
---|---|---|---|---|---|
Standard K-Quants | |||||
Q3_K_S | 3.44 | 0.611 | 0.408 | 0.536 | 0.715 |
Q4_K_S | 4.54 | 0.679 | 0.446 | 0.603 | 0.733 |
Q4_1 | 5.00 | 0.673 | 0.447 | 0.604 | 0.732 |
GPTQ Quantization | |||||
GPTQ-3 | 3.44 | 0.588 | 0.404 | 0.536 | 0.692 |
GPTQ-4 | 4.50 | 0.675 | 0.446 | 0.600 | 0.729 |
GPTQ-5 | 5.50 | 0.678 | 0.449 | 0.604 | 0.744 |
GPTQ-6 | 6.56 | 0.685 | 0.450 | 0.607 | 0.742 |
EvoPress with K-Quants | |||||
EvoPress-kQuant-3 | 3.46 | 0.640 | 0.418 | 0.558 | 0.720 |
EvoPress-kQuant-4 | 4.50 | 0.678 | 0.445 | 0.606 | 0.743 |
EvoPress-kQuant-5 | 5.50 | 0.679 | 0.449 | 0.606 | 0.742 |
EvoPress with GPTQ | |||||
EvoPress-GPTQ-3 | 3.45 | 0.596 | 0.403 | 0.528 | 0.700 |
EvoPress-GPTQ-4 | 4.50 | 0.683 | 0.447 | 0.601 | 0.733 |
EvoPress-GPTQ-5 | 5.51 | 0.680 | 0.448 | 0.604 | 0.740 |
EvoPress with I-Matrix | |||||
EvoPress-kQuant-Imatrix-3 | 3.44 | 0.639 | 0.422 | 0.563 | 0.718 |
EvoPress-kQuant-Imatrix-4 | 4.50 | 0.673 | 0.442 | 0.598 | 0.732 |
EvoPress-kQuant-Imatrix-5 | 5.50 | 0.683 | 0.451 | 0.607 | 0.745 |
The evaluation results demonstrate several insights:
GPTQ Quantization consistently outperforms standard K-Quants at equivalent bitwidths, achieving lower perplexity scores across all datasets. For example, at approximately 4.5 bits, GPTQ-4 achieves a C4 perplexity of 21.25 compared to 21.67 for Q4_K_S, representing a meaningful quality improvement without additional storage overhead.
EvoPress Optimization further enhances quantization quality by discovering optimal per-layer bitwidth configurations. The evolutionary search process identifies which layers are most sensitive to quantization and allocates bits accordingly, achieving better quality-compression tradeoffs than uniform quantization approaches.
Importance Matrix Integration provides the most significant improvements, particularly at lower bitwidths. The EvoPress-kQuant-Imatrix-3 configuration achieves remarkably better perplexity than standard Q3_K_S quantization (24.05 vs 30.45 on C4), demonstrating the value of importance-aware quantization for aggressive compression scenarios.
The zero-shot benchmark results confirm that these perplexity improvements translate to better downstream task performance, with advanced quantization methods maintaining higher accuracy across reasoning and comprehension tasks while achieving equivalent compression ratios.
Several directions could enhance the toolkit's capabilities:
- Mixture-of-Experts (MoE) Support: Extend quantization and full search support for MoE architectures like Mixtral, Qwen MoE and DeepSeek
- More ready-to-use models: Distribute more non-uniform models quantized with our techniques, eliminating the need for users to run the full pipeline.
- Calibration Selection: Selecting optimal calibration datasets based on model characteristics and deployment scenarios.
- Hardware-Aware Optimization: Integrate device-specific constraints into EvoPress search to optimize quantization for target hardware (edge devices, GPUs, etc.).
- datasets 4.0.0
- gguf 0.17.1
- numpy 2.3.2
- pandas 2.3.1
- torch 2.8.0
- torchaudio 2.7.1
- torchvision 0.22.1
- tqdm 4.67.1
- tqdm-multiprocess 0.0.11
- transformers 4.56.0.dev0
- lm-eval 0.4.9.1
- wandb 0.21.0
The DASLab GGUF Toolkit was developed jointly by Max Kleinegger and Michael Helcig, with support from the full lab team.
@article{frantar-gptq,
title={{GPTQ}: Accurate Post-training Compression for Generative Pretrained Transformers},
author={Elias Frantar and Saleh Ashkboos and Torsten Hoefler and Dan Alistarh},
year={2022},
journal={arXiv preprint arXiv:2210.17323}
}
@article{sieberling-evopress,
title={{EvoPress}: Accurate Dynamic Model Compression via Evolutionary Search},
author={Oliver Sieberling and Denis Kuznedelev and Eldar Kurtic and Dan Alistarh},
year={2025},
journal={arXiv preprint arXiv:2410.14649}
}
@misc{egashira2025mindgappracticalattack,
title={Mind the Gap: A Practical Attack on GGUF Quantization},
author={Kazuki Egashira and Robin Staab and Mark Vero and Jingxuan He and Martin Vechev},
year={2025},
eprint={2505.23786},
archivePrefix={arXiv},
primaryClass={cs.CR},
url={https://arxiv.org/abs/2505.23786},
}