⚡ Lit-LLaMA ️

Independent implementation of LLaMA that is fully open source under the Apache 2.0 license.

This implementation builds on nanoGPT.

The original LLaMA weights are distributed by Meta under a research-only license.

New Apache 2.0 licensed weights are being released as part of the Open LLaMA project. Both can be loaded in Lit-LLaMA.

Why?

We believe that AI should be fully open source and part of the collective knowledge.

The original LLaMA code is GPL licensed which means any project using it must also be released under GPL.

This "taints" any other code and prevents integration with the rest of the ecosystem.

Lit-LLaMA solves that for good.

Design principles

Lit-LLaMA is:

Simple: Single-file implementation without boilerplate.
Correct: Numerically equivalent to the original model.
Optimized: Runs on consumer hardware or at scale.
Open-source: No strings attached.

Get involved!

Join our Discord to build high-performance, truly open-source models for the common benefit of the community.

Setup

Clone the repo

git clone https://github.com/Lightning-AI/lit-llama
cd lit-llama

install dependencies

pip install -r requirements.txt

You are all set! 🎉

Use the model

To generate text predictions, you need to download the model weights. If you don't have them, check out our guide.

Run inference:

python generate.py --prompt "Hello, my name is"

This will run the 7B model and require ~26 GB of GPU memory (A100 GPU).

Full guide for generating samples from the model.

Run Lit-LLaMA on consumer devices

On GPUs with bfloat16 support, the generate.py script will automatically convert the weights and consume about ~14 GB. For GPUs with less memory, or ones that don't support bfloat16, enable quantization (--quantize llm.int8):

python generate.py --quantize llm.int8 --prompt "Hello, my name is"

See python generate.py --help for more options.

You can also use GPTQ-style int4 quantization, but this needs conversions of the weights first:

python quantize/gptq.py --checkpoint_path lit-llama.pth --tokenizer_path tokenizer.model --output_path llama-7b-gptq.4bit.pth --dtype bfloat16  --quantize gptq.int4

With the generated quantized checkpoint generation works as usual with --quantize gptq.int4, bringing GPU usage to about ~5GB. As only the weights of the Linear layers are quantized, it is useful to use --dtype bfloat16 even with the quantization enabled.

Full guide for generating samples from the model.

Finetune the model

We provide a simple training scripts in finetune_lora.py and finetune_adapter.py that instruction-tunes a pretrained model on the Alpaca dataset using the techniques of LoRA and Adapter.

Download the data and generate a instruction tuning dataset:
```
python scripts/prepare_alpaca.py
```

Run the finetuning script

python finetune/lora.py

or

python finetune/adapter.py

It is expected that you have downloaded the pretrained weights as described above. The finetuning requires at least one GPU with ~24 GB memory (GTX 3090). Follow the instructions in the script to efficiently fit your GPU memory. Note: For some GPU models you might need to set torch.backends.cuda.enable_flash_sdp(False) (see comments at the top of the script).

More details about each finetuning method and how you can apply it to your own data can be found in our technical how-to guides.

Finetuning How-To Guides

These technical tutorials illustrate how to run the finetuning code.

Finetune with LoRA
Finetune with Adapters

Understanding Finetuning -- Conceptual Tutorials

Looking for conceptual tutorials and explanations? We have some additional articles below:

Understanding Parameter-Efficient Finetuning of Large Language Models: From Prefix Tuning to LLaMA-Adapters

Pre-training

We provide a simple training script based on Fabric if you want to venture into pre-training on RedPajama, a reproduction of the original LLaMA dataset. Conversion scripts for our optimized streaming PackedDataset are included.

Follow this guide to start pre-training on the RedPajama dataset:

Pretrain on RedPajama

Get involved!

We are on a quest towards fully open source AI.

Join us and start contributing, especially on the following areas:

Pre-training
Fine-tuning (full and LoRA)
Quantization
Sparsification

Look at train.py for a starting point towards pre-training / fine-tuning using Lightning Fabric.

We welcome all individual contributors, regardless of their level of experience or hardware. Your contributions are valuable, and we are excited to see what you can accomplish in this collaborative and supportive environment.

Unsure about contributing? Check out our Contributing to Lit-LLaMA: A Hitchhiker’s Guide to the Quest for Fully Open-Source AI guide.

Don't forget to join our Discord!

Acknowledgements

@karpathy for nanoGPT
@FacebookResearch for the original LLaMA implementation
@TimDettmers for bitsandbytes
@Microsoft for LoRA
@IST-DASLab for GPTQ

License

Lit-LLaMA is released under the Apache 2.0 license.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

⚡ Lit-LLaMA ️

⚡ Lit-LLaMA ️

Why?

Design principles

Get involved!

Setup

Use the model

Run Lit-LLaMA on consumer devices

Finetune the model

Finetuning How-To Guides

Understanding Finetuning -- Conceptual Tutorials

Pre-training

Get involved!

Acknowledgements

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

⚡ Lit-LLaMA ️

⚡ Lit-LLaMA ️

Why?

Design principles

Get involved!

Setup

Use the model

Run Lit-LLaMA on consumer devices

Finetune the model

Finetuning How-To Guides

Understanding Finetuning -- Conceptual Tutorials

Pre-training

Get involved!

Acknowledgements

License