TinyLLM

A Framework for Training, Fine-Tuning, and Deploying Smaller LLMs on Custom Datasets

Website: TinyLLM.org
ArXiv: ArXiv/2412.15304

Introduction

TinyLLM is a lightweight and customizable framework for efficiently training, fine-tuning, and deploying small-scale Large Language Models (LLMs) on custom datasets. It is optimized for resource-constrained environments, making it ideal for applications on edge devices and IoT platforms.

TinyLLM demonstrates adaptability across various datasets, particularly in embedded sensing tasks like hand gesture detection, robot localization, and breathing rate detection. The framework enables training smaller models for diverse domains and is not limited to embedded sensing.

Installation

Prerequisites

Dependency Setup:
- Configure llm.c (GPU setup required). Follow the instructions here.
- Set up llama.cpp. Refer to the build guide here.
Python Dependencies:
- Install necessary Python libraries for dataset processing and fine-tuning:
```
pip install -r requirements.txt
```
- Install a suitable version of PyTorch.

Usage

1. Preparing Pre-training Datasets

Click to expand

Navigate to the datasets folder:
```
cd Datasets/
```
Tokenize datasets using encode.py:
- Supports user-provided custom datasets (in CSV format) or datasets hosted on HuggingFace.
- By default, the script processes the Fineweb dataset (10 billion tokens variant, auto-downloaded) and the SHL IoT sensor dataset.
- Follow the instructions here to download the SHL dataset.
- Update the datasets_to_tokenize parameter in encode.py for custom datasets.
```
python encode.py
```
Rename tokenized datasets for clarity (e.g., Fineweb, SHL).
Split datasets using split.py:
```
python split.py -d1 0.3 -d2 0.7 -o ./pretraining_data
```
- Default parameters produce a dataset with 9 billion tokens and a Training:Validation split of 98:2 with 100MB shards.
- Adjust parameters or shard size if faced with memory constraints.

2. Pre-training the Model

Click to expand

Navigate to the llm.c folder:
```
cd ../llm.c/
```

Begin pre-training with:

./train_gpt2cu \
    -i "Datasets/pretraining_data/train*.bin" \
    -j "Datasets/pretraining_data/val*.bin" \
    -o "custom_model" \
    -e "d6" \
    -b 64 -t 1024 \
    -d 524288 \
    -r 1 \
    -z 1 \
    -c 0.1 \
    -l 0.0006 \
    -q 0.0 \
    -u 700 \
    -n 10000 \
    -v 250 -s 20000 \
    -h 1

Key Flags:
- -e: Model depth (e.g., d6, d12).
- -o: Output directory for the trained model.
- -y 1: Resume from the last checkpoint.
- Use -x for multiple epochs.
- Full list of flags and descriptions is here.

Export the model in HuggingFace-compatible format:

lf=$(ls custom_model/model_0*.bin | sort -V | tail -n 1) 
python dev/eval/export_hf.py -i "$lf" -o "custom_model_hf"

3. Fine-tuning the Model

Click to expand

Navigate to the Fine-tune folder:
```
cd ../Fine-tune/
```
Set parameters in the respective model's parameter file (e.g., p-gpt.txt).

Run the fine-tuning script:

python master.py \
    -d "breathe" \
    -m "../llm.c/custom_model_hf" \
    -n "gpt2" \
    -p "p-gpt.txt" | tee ft_output.log

-d: Dataset name (e.g., breathe, gesture).
-m: Path to the pre-trained model.
-n: Model name (gpt2, llama, phi).

Results:
- Training and evaluation loss plots are saved in results/{model}/{dataset}/loss.pdf. To view in the terminal,
```
   cat "results/GPT 2/breathe-0/loss.txt"  
```
- Testing data can be viewed directly or processed for analysis.

4. Inferencing the Model

Click to expand

Use HuggingFace's transformers library for inference:

from transformers import pipeline
import torch

path = "./TinyLLM/Fine-tune/results/GPT 2/breathe-0/"
generator = pipeline("text-generation", model=path, max_new_tokens=30, repetition_penalty=1.3, device_map="auto")
prompt = "Your input text here"
print(generator(prompt)[0]['generated_text'])

Convert the model to GGUF format for embedded devices:

cd ../llama.cpp/
python convert_hf_to_gguf.py "../Fine-tune/results/GPT 2/breathe-0/" --outfile "../Fine-tune/results/GPT 2/breathe-0/model.gguf"

Use the model with llama.cpp:

./llama-cli -m "../Fine-tune/results/GPT 2/breathe-0/model.gguf" -n 10 -p "Your input prompt"

Optionally, quantize the model for optimized inference (details).

Notes

Currently, 3 in-house processed datasets (gesture detection, localisation, and breathing detection) are provided for fine-tuning, apart from swim, which has to be processed (find more about using the dataset here). More information about the in-house datasets will be updated soon here.
The checkpoints created during the fine-tuning process can be removed later to save space.

Contributing

We welcome contributions to TinyLLM! Visit our HuggingFace page for pre-trained models on web and sensor data.

Acknowledgments

Thank you to the creators of llm.c and llama.cpp for their groundbreaking tools.

We also acknowledge the use of external datasets:

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
Datasets		Datasets
Fine-tune		Fine-tune
llama.cpp @ 678d799		llama.cpp @ 678d799
llm.c @ d3ce154		llm.c @ d3ce154
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TinyLLM

Introduction

Installation

Prerequisites

Usage

1. Preparing Pre-training Datasets

2. Pre-training the Model

3. Fine-tuning the Model

4. Inferencing the Model

Notes

Contributing

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

License

weiserlab/TinyLLM

Folders and files

Latest commit

History

Repository files navigation

TinyLLM

Introduction

Installation

Prerequisites

Usage

1. Preparing Pre-training Datasets

2. Pre-training the Model

3. Fine-tuning the Model

4. Inferencing the Model

Notes

Contributing

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages