Skip to content

Latest commit

 

History

History
121 lines (91 loc) · 5.53 KB

HuggingFace.md

File metadata and controls

121 lines (91 loc) · 5.53 KB

ZipNN and Hugging Face Integration

Compress and Upload a Model to Hugging Face

  1. Fork the model's Hugging Face repo (adapted from the documentation):
git lfs install --skip-smudge --local &&
git remote add upstream [email protected]:ibm-granite/granite-7b-instruct &&
git fetch upstream &&
git lfs fetch --all upstream
  • If you want to completely override the fork history (which should only have an initial commit), run:
git reset --hard upstream/main &&
git lfs pull upstream
  • If you want to rebase instead of overriding, run the following command and resolve any conflicts:
git rebase upstream/main &&
git lfs pull upstream
  1. Compress all the model weights Download the scripts for compressing/decompressing AI Models:
wget -i https://raw.githubusercontent.com/zipnn/zipnn/main/scripts/scripts.txt &&
rm scripts.txt
python3 zipnn_compress_path.py safetensors --path .
  1. Add the compressed weights to git-lfs tracking and correct the index json
git lfs track "*.znn" &&
sed -i 's/.safetensors/.safetensors.znn/g' model.safetensors.index.json &&
git add *.znn .gitattributes model.safetensors.index.json &&
git rm *.safetensors
  1. Done! Now push the changes as per the documentation:
git lfs install --force --local && # this reinstalls the LFS hooks
huggingface-cli lfs-enable-largefiles . && # needed if some files are bigger than 5GB
git push --force origin main

To use the model simply run our ZipNN Hugging Face method before proceeding as normal:

from zipnn import zipnn_hf

zipnn_hf()

# Load the model from your compressed Hugging Face model card as you normally would
...

Download Compressed Models from Hugging Face

In this example we show how to use the compressed ibm-granite granite-7b-instruct hosted on Hugging Face.

First, make sure you have ZipNN installed:

pip install zipnn

To run the model, simply add zipnn_hf() at the beginning of the file, and it will take care of decompression for you. By default, the model remains compressed in your local storage, decompressing quickly on the CPU only during loading.

from transformers import AutoTokenizer, AutoModelForCausalLM
from zipnn import zipnn_hf

zipnn_hf()

tokenizer = AutoTokenizer.from_pretrained("royleibov/granite-7b-instruct-ZipNN-Compressed")
model = AutoModelForCausalLM.from_pretrained("royleibov/granite-7b-instruct-ZipNN-Compressed")

Alternatively, you can save the model uncompressed on your local storage. This way, future loads won’t require a decompression phase.

zipnn_hf(replace_local_file=True)

To compress and decompress manually, simply run:

python zipnn_compress_path.py safetensors --model royleibov/granite-7b-instruct-ZipNN-Compressed --hf_cache
python zipnn_decompress_path.py --model royleibov/granite-7b-instruct-ZipNN-Compressed --hf_cache

You can try other state-of-the-art compressed models from the updating list below:

ZipNN Compressed Models Hosted on Hugging Face
compressed FacebookAI/roberta-base
compressed meta-llama/Llama-3.2-11B-Vision-Instruct
compressed ibm-granite/granite-3.0-8b-instruct
compressed openai/clip-vit-base-patch16
compressed jonatasgrosman/wav2vec2-large-xlsr-53-english
compressed mistral-community/pixtral-12b
compressed meta-llama/Meta-Llama-3.1-8B-Instruct
compressed Qwen/Qwen2-VL-7B-Instruct
compressed ai21labs/Jamba-v0.1
compressed upstage/solar-pro-preview-instruct
compressed microsoft/Phi-3.5-mini-instruct
compressed ibm-granite/granite-7b-instruct
compressed ibm-granite/granite-3b-code-base-128k

You can also try one of these python notebooks hosted on Kaggle: granite 3b, Llama 3.2, phi 3.5.

Click here to explore other examples of compressed models hosted on Hugging Face