A Framework for Training, Fine-Tuning, and Deploying Smaller LLMs on Custom Datasets
- Website: TinyLLM.org
- ArXiv: ArXiv/2412.15304
TinyLLM is a lightweight and customizable framework for efficiently training, fine-tuning, and deploying small-scale Large Language Models (LLMs) on custom datasets. It is optimized for resource-constrained environments, making it ideal for applications on edge devices and IoT platforms.
TinyLLM demonstrates adaptability across various datasets, particularly in embedded sensing tasks like hand gesture detection, robot localization, and breathing rate detection. The framework enables training smaller models for diverse domains and is not limited to embedded sensing.
-
Dependency Setup:
-
Python Dependencies:
- Install necessary Python libraries for dataset processing and fine-tuning:
pip install -r requirements.txt
- Install a suitable version of PyTorch.
- Install necessary Python libraries for dataset processing and fine-tuning:
Click to expand
-
Navigate to the datasets folder:
cd Datasets/
-
Tokenize datasets using
encode.py
:- Supports user-provided custom datasets (in CSV format) or datasets hosted on HuggingFace.
- By default, the script processes the Fineweb dataset (10 billion tokens variant, auto-downloaded) and the SHL IoT sensor dataset.
- Follow the instructions here to download the SHL dataset.
- Update the
datasets_to_tokenize
parameter inencode.py
for custom datasets.python encode.py
-
Rename tokenized datasets for clarity (e.g.,
Fineweb
,SHL
). -
Split datasets using
split.py
:python split.py -d1 0.3 -d2 0.7 -o ./pretraining_data
- Default parameters produce a dataset with 9 billion tokens and a Training:Validation split of 98:2 with 100MB shards.
- Adjust parameters or shard size if faced with memory constraints.
Click to expand
-
Navigate to the
llm.c
folder:cd ../llm.c/
-
Begin pre-training with:
./train_gpt2cu \ -i "Datasets/pretraining_data/train*.bin" \ -j "Datasets/pretraining_data/val*.bin" \ -o "custom_model" \ -e "d6" \ -b 64 -t 1024 \ -d 524288 \ -r 1 \ -z 1 \ -c 0.1 \ -l 0.0006 \ -q 0.0 \ -u 700 \ -n 10000 \ -v 250 -s 20000 \ -h 1
-
Key Flags:
-e
: Model depth (e.g.,d6
,d12
).-o
: Output directory for the trained model.-y 1
: Resume from the last checkpoint.- Use
-x
for multiple epochs. - Full list of flags and descriptions is here.
-
Export the model in HuggingFace-compatible format:
lf=$(ls custom_model/model_0*.bin | sort -V | tail -n 1) python dev/eval/export_hf.py -i "$lf" -o "custom_model_hf"
Click to expand
-
Navigate to the
Fine-tune
folder:cd ../Fine-tune/
-
Set parameters in the respective model's parameter file (e.g.,
p-gpt.txt
). -
Run the fine-tuning script:
python master.py \ -d "breathe" \ -m "../llm.c/custom_model_hf" \ -n "gpt2" \ -p "p-gpt.txt" | tee ft_output.log
-d
: Dataset name (e.g.,breathe
,gesture
).-m
: Path to the pre-trained model.-n
: Model name (gpt2
,llama
,phi
).
-
Results:
- Training and evaluation loss plots are saved in
results/{model}/{dataset}/loss.pdf
. To view in the terminal,cat "results/GPT 2/breathe-0/loss.txt"
- Testing data can be viewed directly or processed for analysis.
- Training and evaluation loss plots are saved in
Click to expand
-
Use HuggingFace's transformers library for inference:
from transformers import pipeline import torch path = "./TinyLLM/Fine-tune/results/GPT 2/breathe-0/" generator = pipeline("text-generation", model=path, max_new_tokens=30, repetition_penalty=1.3, device_map="auto") prompt = "Your input text here" print(generator(prompt)[0]['generated_text'])
-
Convert the model to GGUF format for embedded devices:
cd ../llama.cpp/ python convert_hf_to_gguf.py "../Fine-tune/results/GPT 2/breathe-0/" --outfile "../Fine-tune/results/GPT 2/breathe-0/model.gguf"
-
Use the model with llama.cpp:
./llama-cli -m "../Fine-tune/results/GPT 2/breathe-0/model.gguf" -n 10 -p "Your input prompt"
-
Optionally, quantize the model for optimized inference (details).
- Currently, 3 in-house processed datasets (gesture detection, localisation, and breathing detection) are provided for fine-tuning, apart from
swim
, which has to be processed (find more about using the dataset here). More information about the in-house datasets will be updated soon here. - The checkpoints created during the fine-tuning process can be removed later to save space.
We welcome contributions to TinyLLM! Visit our HuggingFace page for pre-trained models on web and sensor data.
Thank you to the creators of llm.c and llama.cpp for their groundbreaking tools.
We also acknowledge the use of external datasets: