LLaMA2 Inference and Quantization

Welcome to the LLaMA2 Inference and Quantization project! This repository provides tools and libraries for running and optimizing the LLaMA2 7B model for text generation. This guide will help you set up the environment, convert the model weights, and perform efficient inference with quantization support.

Requirements

Mandatory

Linux (Mac might also work)
Python 3
pip
A C compiler (tested on gcc and clang)

Optional

CUDA (nvcc) for GPU support

Setup Guide

1. Install Dependencies

Run the following command to install all required Python packages:

pip install -r requirements.txt

2. Download Model Weights

Download the Meta LLaMA2 7B weights from here. We highly recommend downloading the llama2-7b-chat model for superior text generation performance.

3. Organize Model Files

Create a /bin directory in the project root and move the llama-2-7b-chat directory and tokenizer.model file into the /bin directory.

4. Export Model Weights

Export the model weights into our binary file format by running:

python3 export.py --help

Note: This process requires at least ~28GB of peak RAM.

5. Compile the C/CUDA Library

Compile the library used for quantization and model inference with the following commands:

mkdir build && cd build
cmake .. -DCUDA=ON # with CUDA support
cmake .. # without CUDA
make

6. Run Text Completion

You can now run text completion using the base model:

python3 run.py --bin=bin/chat-llama.bin "Richard Feynman was a "

Note: This can be slow. For better performance, quantize the model weights as described below.

Quantization for Improved Performance

We support 8-bit and 4-bit quantization to speed up inference. For GPUs, only a 4-bit forward pass is implemented to accommodate hardware constraints, allowing it to run on almost any GPU with around 6GB of VRAM.

Quantize the Model Weights

Run the quantization script:

python3 quantize.py --help

Inference Tips

Adjust text generation parameters like type (top_p, top_k, greedy) and temperature to improve inference quality.
For prompt engineering, refer to the chat.py file.
For a deeper understanding of LLaMA2, check out model.py for a simple PyTorch implementation, and explore the CUDA/C version in /lib.

LLama2 model information

Input

tok_embeddings (vocab_size, embedding_dim) = (32000,4096)

Layer 0-31

Attention

attention_norm (embedding_dim) = (4096)
attention.wq (embedding_dim, embedding_dim) = (4096, 4096)
attention.wk (embedding_dim, embedding_dim) = (4096, 4096)
attention.wv (embedding_dim, embedding_dim) = (4096, 4096)
attention.w0 (embedding_dim, embedding_dim) = (4096, 4096)

FeedFordward

feed_forward.norm (embedding_dim) = (4096)
feed_forward.w1 (embedding_dim, hidden_dim) = (4096, 11008)
feed_forward.w3 (embedding_dim, hidden_dim) = (4096, 11008)

(w1 and w3 get both applied to the input embeddings and then element wise multiplied)

feed_forward.w2 (hidden_dim, embedding_dim) = (11008, 4096)

Outputs

norm (embedding_dim) = (4096)
output (embedding_dim, vocab_size) = (4096, 32000)

Understanding the Code

Python Files

generate.py: Contains the main function for text generation using different sampling methods (greedy, top_p, top_k).
qllama.py: Defines the Runtime classes for various quantization levels (F32, Q8, Q4) and includes GPU support.
llama_utils.py: Provides utility functions to load the model and tokenizer.
quantize.py: Implements the quantization process for the model weights.
run.py: Script to run the text generation using the quantized or non-quantized models.

C/CUDA Files

llama.c: Core implementation for initializing and running the LLaMA model, supporting both full precision and quantized versions.
ops.c: Contains various mathematical operations used in model inference and quantization.
quantize.c: Handles the quantization process, converting model weights to 8-bit or 4-bit precision.
cu-llama.c: CUDA-specific implementations to run the model on GPU, including memory management and kernel launches.
cu-ops.c: CUDA operations for quantization, dequantization, matrix-vector multiplications, and other neural network operations.

References:

We hope you find this guide helpful! Happy experimenting with LLaMA2!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLaMA2 Inference and Quantization

Requirements

Mandatory

Optional

Setup Guide

1. Install Dependencies

2. Download Model Weights

3. Organize Model Files

4. Export Model Weights

5. Compile the C/CUDA Library

6. Run Text Completion

Quantization for Improved Performance

Quantize the Model Weights

Inference Tips

LLama2 model information

Input

Layer 0-31

Attention

FeedFordward

Outputs

Understanding the Code

Python Files

C/CUDA Files

References:

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
lib		lib
.gitignore		.gitignore
AAML___LLaMA_from_Scratch-4.pdf		AAML___LLaMA_from_Scratch-4.pdf
CMakeLists.txt		CMakeLists.txt
README.md		README.md
chat.py		chat.py
export.py		export.py
generate.py		generate.py
llama_architecture.png		llama_architecture.png
llama_utils.py		llama_utils.py
model.py		model.py
qllama.py		qllama.py
quantize.py		quantize.py
requirements.txt		requirements.txt
run.py		run.py
visualize_rope.ipynb		visualize_rope.ipynb

ziereis/simple-llama

Folders and files

Latest commit

History

Repository files navigation

LLaMA2 Inference and Quantization

Requirements

Mandatory

Optional

Setup Guide

1. Install Dependencies

2. Download Model Weights

3. Organize Model Files

4. Export Model Weights

5. Compile the C/CUDA Library

6. Run Text Completion

Quantization for Improved Performance

Quantize the Model Weights

Inference Tips

LLama2 model information

Input

Layer 0-31

Attention

FeedFordward

Outputs

Understanding the Code

Python Files

C/CUDA Files

References:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages