This repository is the official implementation of Improving Block-Wise LLM Quantization by 4-bit Block-Wise Optimal Float (BOF4): Analysis and Variations.
To install requirements and the bof4 package:
pip install -r requirements.txt
pip install flash-attn==2.7.3 --no-build-isolation
pip install -e .
To run QLoRA fine-tuning with configurable quantizers the scripts/finetune.py can be used.
The files config/finetune_code.yaml and config/finetune_instruct.yaml
contain the configuration for reproducing the fine-tuning experiments from the paper.
All utilized quantizer codebooks can be found in codebooks.
For example, to finetune with BOF4-S quantization with block size 64 run:
python scripts/finetune.py --config config/finetune_code.yaml --quantizer codebooks/bof4/bof4-s_mse_64.yaml
To evaluate a model with quantization on the set of benchmarks used in the paper, run
python scripts/eval.py -m meta-llama/Llama-3.2-3B -q codebooks/bof4/bof4-s_mse_64.yaml
For a full list of options run:
python scripts/eval.py -h