Made in Vancouver, Canada by Picovoice
This repository is a minimalist and extensible framework for benchmarking LLM compression algorithms.
GPTQ is arguably the most popular quantization algorithm for LLMs. GPTQ fully reconstructs weights so that the quantized version closely mimics the full-precision one.
picoLLM Compression is Picovoice's in-house LLM compression algorithm. Given a target size, picoLLM optimally distributes available bits within and across LLM's weights.
MMLU (Massive Multitask Language Understanding) is a multiple-choice dataset that can measure the models' ability to understand natural language.
ARC (AI2 Reasoning Challenge) is a multiple-choice dataset that measures
the models' reasoning ability. The ARC dataset has two partitions: Easy
and Challenge
. We perform the benchmark on
both partitions and report the results separately.
Perplexity measures the models' language modeling capabilities.
The'/res' folder contains all required data for the benchmark. To reproduce it, follow the sections below.
Download the MMLU dataset and run the following from the repository's root to extract and format it:
python3 data/mmlu.py --dataset-folder ${DATASET_FOLDER}
Download the ARC dataset and run the following from the repository's root to extract and
format the Challenge
portion:
python3 data/arc.py --dataset-folder ${DATASET_FOLDER}
Perform the above for the Easy
portion:
python3 data/arc.py --dataset-folder ${DATASET_FOLDER} --easy
For the perplexity measurement, we use 128 randomly selected text snippets from the validation portion of the C4 dataset. Once you download the dataset, run the following from the root of the repository to extract and normalize the data:
python3 data/c4-normalize.py \
--repository-folder ${REPOSITORY_FOLDER} \
--normalized-folder ${VALIDATION_FOLDER} \
--portion validation
Replace ${REPOSITORY_FOLDER}
with the path to the downloaded dataset repository and ${VALIDATION_FOLDER}
with a
folder to hold onto the normalized data.
Then we sample 128 sequences from the normalized data:
python3 data/c4-sample.py \
--dataset-folder ${VALIDATION_FOLDER} \
--portion valid
We need a sample dataset for quantization algorithms (GPTQ, picoLLM). We use 128 randomly selected text snippets from the train portion of the C4 dataset. Once you download the dataset, run the following from the root of the repository to extract and normalize the data:
python3 data/c4-normalize.py \
--repository-folder ${REPOSITORY_FOLDER} \
--normalized-folder ${TRAIN_FOLDER} \
--portion train
Replace ${REPOSITORY_FOLDER}
with the path to the downloaded dataset repository and ${TRAIN_FOLDER}
with a
folder to hold onto the normalized data.
Then we sample 128 sequences from the normalized data:
python3 data/c4-sample.py \
--dataset-folder ${TRAIN_FOLDER} \
--portion train
We use six models:
Gemma-2b
Gemma-7b
Llama-2-7b
Llama-3-8b
Mistral-7b-v0.1
Phi-2
The corresponding picoLLM compressed models are on Picovoice Console. We create GPTQ models using the package AutoGPTQ. You can quantize the models by running the following:
python3 model/autogptq.py \
--model-uri ${MODEL_URI} \
--quantized-model-folder ${QUANTIZED_MODEL_FOLDER} \
--bits ${BITS}
To measure the MMLU score for a given model, run the following:
python3 mmlu.py \
--compression ${COMPRESSION} \
--model-uri ${MODEL_URI}
Replace ${COMPRESSION}
with the model's compression. i.e., NONE
for full-precision models, GPTQ,
or picoLLM.
To measure the ARC score for a given model, run the following:
python3 arc.py \
--compression ${COMPRESSION} \
--model-uri ${MODEL_URI}
Replace ${COMPRESSION}
with the model's compression. i.e., NONE
for full-precision models, GPTQ,
or picoLLM.
To measure the perplexity for a given model, run the following:
python3 perplexity.py \
--compression ${COMPRESSION} \
--model-uri ${MODEL_URI}
Replace ${COMPRESSION}
with the model's compression. i.e., NONE
for full-precision models, GPTQ,
or picoLLM.
When running picoLLM Compressed models, you must also provide your Picovoice AccessKey, which is available on Picovoice Console.
... --picollm-access-key ${PICOLLM_ACCESS_KEY}
Below are our benchmark results comparing GPTQ against picoLLM for all models. We perform 2, 3, and 4-bit quantization using GPTQ, then find the model size in GB and set that as the target size for picoLLM Compression. Hence, both models have the same size in terms of the number of bytes. When performing GPTQ, we set the group size parameter to 128, set the damp percent to 0.1 and enabled activation reordering.
The table below depicts the MMLU score of the original models.
Model | MMLU |
Gemma-2b 5.0G | 40.21 |
Gemma-7b 17.1G | 64.48 |
Llama-3-8b 16.1G | 64.88 |
Llama-2-7b 13.5G | 46.38 |
Mistral-7b-v0.1 15.0G | 62.41 |
Phi-2 5.6G | 56.04 |
The table below depicts the MMLU score of the quantized models.
Model | GPTQ | picoLLM |
Gemma-2b 3.1G | 39.07 | 41.12 |
Gemma-2b 2.9G | 27.51 | 41.12 |
Gemma-2b 2.6G | 24.93 | 41.12 |
Gemma-7b 7.2G | 62.58 | 64.98 |
Gemma-7b 6.2G | 53.30 | 64.57 |
Gemma-7b 5.2G | 25.58 | 64.32 |
Llama-2-7b 3.9G | 45.26 | 44.99 |
Llama-2-7b 3.1G | 40.40 | 40.68 |
Llama-2-7b 2.3G | 25.36 | 28.72 |
Llama-3-8b 5.7G | 63.09 | 64.96 |
Llama-3-8b 4.9G | 53.86 | 64.76 |
Llama-3-8b 4.0G | 25.05 | 61.26 |
Mistral-7b-v0.1 4.2G | 61.00 | 59.19 |
Mistral-7b-v0.1 3.3G | 23.73 | 57.72 |
Mistral-7b-v0.1 2.4G | 25.70 | 43.53 |
Phi-2 1.8G | 54.61 | 54.11 |
Phi-2 1.5G | 50.64 | 52.24 |
Phi-2 1.2G | 26.05 | 48.86 |
The table below depicts the ARC Easy score of the original models.
Model | ARC Easy |
Gemma-2b 5.0G | 33.75 |
Gemma-7b 17.1G | 75.51 |
Llama-2-7b 13.5G | 44.87 |
Llama-3-8b 16.1G | 75.80 |
Mistral-7b-v0.1 15.0G | 80.56 |
Phi-2 5.6G | 75.25 |
The table below depicts the ARC Easy score of the quantized models.
Model | GPTQ | picoLLM |
Gemma-2b 3.1G | 30.39 | 34.39 |
Gemma-2b 2.9G | 24.37 | 34.39 |
Gemma-2b 2.6G | 23.82 | 34.39 |
Gemma-7b 7.2G | 76.52 | 84.18 |
Gemma-7b 6.2G | 44.28 | 84.51 |
Gemma-7b 5.2G | 23.95 | 84.13 |
Llama-2-7b 3.9G | 39.23 | 41.96 |
Llama-2-7b 3.1G | 32.95 | 33.96 |
Llama-2-7b 2.3G | 23.91 | 24.49 |
Llama-3-8b 5.7G | 72.85 | 78.83 |
Llama-3-8b 4.9G | 43.39 | 77.02 |
Llama-3-8b 4.0G | 24.71 | 71.76 |
Mistral-7b-v0.1 4.2G | 77.27 | 73.95 |
Mistral-7b-v0.1 3.3G | 23.91 | 72.10 |
Mistral-7b-v0.1 2.4G | 24.92 | 46.46 |
Phi-2 1.8G | 70.45 | 75.04 |
Phi-2 1.5G | 56.61 | 70.66 |
Phi-2 1.2G | 22.10 | 62.42 |
The table below depicts the ARC Challenge score of the original models.
Model | ARC Challenge |
Gemma-2b 5.0G | 30.38 |
Gemma-7b 17.1G | 64.93 |
Llama-2-7b 13.5G | 37.03 |
Llama-3-8b 16.1G | 63.05 |
Mistral-7b-v0.1 15.0G | 67.49 |
Phi-2 5.6G | 61.60 |
The table below depicts the ARC Challenge score of the quantized models.
Model | GPTQ | picoLLM |
Gemma-2b 3.1G | 26.37 | 30.97 |
Gemma-2b 2.9G | 23.55 | 30.97 |
Gemma-2b 2.6G | 24.83 | 30.97 |
Gemma-7b 7.2G | 66.30 | 72.35 |
Gemma-7b 6.2G | 33.62 | 72.35 |
Gemma-7b 5.2G | 24.06 | 72.61 |
Llama-2-7b 3.9G | 32.42 | 34.30 |
Llama-2-7b 3.1G | 27.56 | 28.24 |
Llama-2-7b 2.3G | 21.16 | 23.63 |
Llama-3-8b 5.7G | 60.24 | 64.33 |
Llama-3-8b 4.9G | 36.18 | 63.48 |
Llama-3-8b 4.0G | 23.29 | 57.85 |
Mistral-7b-v0.1 4.2G | 64.42 | 60.49 |
Mistral-7b-v0.1 3.3G | 24.06 | 59.04 |
Mistral-7b-v0.1 2.4G | 23.21 | 37.80 |
Phi-2 1.8G | 57.42 | 62.46 |
Phi-2 1.5G | 44.97 | 57.51 |
Phi-2 1.2G | 24.49 | 47.87 |
The table below depicts the perplexity of the original models.
Model | Perplexity |
Gemma-2b 5.0G | 16.79 |
Gemma-7b 17.1G | 14.67 |
Llama-2-7b 13.5G | 8.40 |
Llama-3-8b 16.1G | 11.61 |
Mistral-7b-v0.1 15.0G | 10.50 |
Phi-2 5.6G | 17.38 |
The table below depicts the perplexity of the quantized models.
Model | GPTQ | picoLLM |
Gemma-2b 3.1G | 17.85 | 16.86 |
Gemma-2b 2.9G | 24.11 | 16.86 |
Gemma-2b 2.6G | 8377.74 | 16.86 |
Gemma-7b 7.2G | 15.47 | 14.82 |
Gemma-7b 6.2G | 27.29 | 14.84 |
Gemma-7b 5.2G | 33370970.40 | 15.08 |
Llama-2-7b 3.9G | 8.59 | 8.50 |
Llama-2-7b 3.1G | 9.66 | 8.86 |
Llama-2-7b 2.3G | 67.43 | 10.87 |
Llama-3-8b 5.7G | 12.31 | 11.73 |
Llama-3-8b 4.9G | 17.47 | 11.90 |
Llama-3-8b 4.0G | 712.70 | 12.67 |
Mistral-7b-v0.1 4.2G | 10.43 | 10.62 |
Mistral-7b-v0.1 3.3G | 2909.83 | 10.81 |
Mistral-7b-v0.1 2.4G | 1176.43 | 14.87 |
Phi-2 1.8G | 18.15 | 17.76 |
Phi-2 1.5G | 19.94 | 18.14 |
Phi-2 1.2G | 76.55 | 20.22 |