This repo is motivated by Sebastian Raschka's book "Build a Large Language Model from Scratch".
Thanks to Lightning Studios for providing access to a GPU and a great dev environment.
Parameter breakdown:
| Name | Type | Params | Mode
--------------------------------------------------------
0 | model | GPT | 407 M | train
1 | lora_module_list | ModuleList | 786 K | train
--------------------------------------------------------
786 K Trainable params
406 M Non-trainable params
407 M Total params
I freeze the GPT2-medium model parameters and only train the LoRA adaptors. The resulting checkpoints are pretty small at 4.7 Mb
including LoRA weights and optimiser states. This is when training in bfloat16
. For comparison, the GPT2-medium checkpoint itself is 1.5 Gb
large. The LoRA parameters account for 0.19%
of all, 407M
, parameters.
- I did everything in Lightning - making it much more robust and customisable when it comes to running experiments.
- Also added attention padding mask for the input tokens. setting attention of any proper token to a padding token to
0
. - I trained my model in
bfloat16
for more speed and less memory consumption. - I implemented my own
LoRA
finetuning utilities, which help me do parameter-efficient fine-tuning rather than changing the weights of the entire model. - I used Weights & Biases for experiment tracking.
- I deployed the model to Gradio app using Docker.
- I tested the model inference using Flask and Docker.
-
Build the image:
docker build -t gradio-image -f api_server/Dockerfile-gradio .
-
Run the container:
docker run -p 5000:5000 --name gradio-cont-1 gradio-image
-
Navigate to
localhost:5000
and start interacting with the app:
-
Build the image:
docker build -t flask-gpt -f api_server/Dockerfile .
-
Run the container:
docker run -p 5000:5000 --name flask-cont-1 flask-gpt
and start giving instructions:
-
The bot "knows" the capital of Bulgaria:
curl -X POST http://0.0.0.0:5000/predict -H "Content-Type: application/json" -d '{"instruction": "What is the capital of Bulgaria?"}'
I get the answer:
{"body":"{\"answer\": \"### Response:\\nThe capital of Bulgaria is Sofia.<|endoftext|>\"}","headers":{"Content-Type":"application/json"},"statusCode":200}
which is correct! Without any entry relating to Bulgaria in the finetuning instructions.
-
Let's see if it can deal with inputs:
curl -X POST http://0.0.0.0:5000/predict -H "Content-Type: application/json" -d '{"instruction": "Classify an input string as either a noun or a verb.", "input": "Dance"}'
I got
{"body":"{\"answer\": \"### Response:\\nDance is a verb.<|endoftext|>\"}","headers":{"Content-Type":"application/json"},"statusCode":200}
which is correct.
-
Interestingly the bot seems to "think" it's a student from Berkeley.
curl -X POST http://0.0.0.0:5000/predict -H "Content-Type: application/json" -d '{"instruction": "who are you"}'
{"body":"{\"answer\": \"?\\n\\n### Response:\\nI am a student at the University of California, Berkeley.<|endoftext|>\"}","headers":{"Content-Type":"application/json"},"statusCode":200}
-
The decoding above is greedy, so I tried a stochastic decoding with
temperature=1
.curl -X POST http://0.0.0.0:5000/predict -H "Content-Type: application/json" -d '{"instruction": "who are you", "temperature": 1}'
and I got
{"body":"{\"answer\": \"?\\n\\n### Response:\\nI am a member of the Royal Society of London.<|endoftext|>\"}","headers":{"Content-Type":"application/json"},"statusCode":200}
so the stochastic decoding functionality works!
-
I fine-tuned a
gpt2-medium
checkpoint from hugging face's repo of openai. -
Frist, cd in the root of the repo.
-
Second, run
export PYTHONPATH=.
-
Third, run
wandb login
to login to the wandb with your api key. -
Fourth, run
export WANDB_START_METHOD="thread"
otherwise some weird threading exception occurs. For more info see this issue. -
Fifth, run the training command to train on
GPU
:python training/run_finetuning.py fit --config finetune_config.yaml --trainer.accelerator=gpu --trainer.devices=1 --trainer.max_epochs=48 --trainer.check_val_every_n_epoch=2 --trainer.log_every_n_steps=5 --data.num_workers=4 --my_model_checkpoint.every_n_epochs=4 --model.lr=3e-4 --model.do_lora=true --model.lora_rank=8 --model.from_pretrained_model=gpt2-medium --data.batch_size=128 --trainer.precision=bf16-true
-
I checked
lora_rank
in[8, 16, 32, 64]
and there wasn't much difference in loss or generated responses, so I sticked tolora_rank=8
for my final model. the final validation loss with this is around0.8672
and the final training loss is around0.7617
. -
The fine-tuning was done on a single
L4
GPU and each run took around 5 minutes when inbfloat16
format (and around 9 minutes if in full precision).
The same dataset used as in chapter 7 from the book. I think it's a subset of the Alpaca dataset.
Link to the data in json format is here
The size of the data were:
{
"train_len": 935,
"val_len": 55,
"test_len": 110
}
as per the metadata.json file.
-
The logs from generating responses to a subset of the validation set at the end of training:
All generations ended with the
<|endoftext|>
token, so they the model "knew" when to stop and did not require truncation. -
Training loss for
lora_rank in [8, 16, 32, 64]
.Roughly the same for all settings of
lora_rank
. -
Validation loss for
lora_rank in [8, 16, 32, 64]
. -
GPU utilisation:
-
Interesting error for torch determinstic run of
nn.functional.cross_entropy
. Resolved by setting--trainer.deterministic=false
.return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: nll_loss2d_forward_out_cuda_template does not have a deterministic implementation, but you set 'torch.use_deterministic_algorithms(True)'. You can turn off determinism just for this operation, or you can use the 'warn_only=True' option, if that's acceptable for your application. You can also file an issue at https://github.com/pytorch/pytorch/issues to help us prioritize adding deterministic support for this operation.
python training/run_finetuning.py fit --config finetune_config.yaml --trainer.accelerator=auto --trainer.devices=1 --trainer.max_epochs=2 --trainer.check_val_every_n_epoch=1 --trainer.log_every_n_steps=25 --data.num_workers=1 --my_model_checkpoint.every_n_epochs=2 --model.lr=5e-5 --model.do_lora=true --model.from_pretrained_model=gpt2-medium --trainer.limit_train_batches=2 --trainer.limit_val_batches=2
- Lots of memory caching going on in parallel attention implementation as compared to for loop over heads implementation. Makes sense since each head computation is a lot less than doing everything at once. PyTorch and TensorFlow seem to be caching intermediate results.
Pre-norm vs Post-norm paper here
- The dataset for testing this was "The Verdict" as per chapter 5 of the "Build a Large Language Model from Scratch" book.
- Pre-norm looks to work a lot better than post-norm. When pre-training
gpt2-small
from scratch for 10 epochs, I got0.7651
loss withpre-norm
and6.0799
withpost-norm
.- 10 epochs of training with
pre-norm
led to Average batch train loss in epoch of0.7651
.-
Given the prompt "Every effort moves you", the generation via sampling with
temperature=1
is:Every effort moves you?"\n\n"Yes--quite insensible to the irony. She wanted him vindicated--and by me!"\n\nHe laughed again, and threw back his head to look up at the sketch of the donkey. "There were days when I'
-
- 10 epochs of training with
post-norm
led to Average batch train loss in epoch of6.0799
-
Given the prompt "Every effort moves you", the generation via sampling with
temperature=1
is:Every effort moves you,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
-
- 10 epochs of training with
- Fresh inits, as expected, worked better. Idk why the PyTorch implementation
of
TransformerDecoder
opts in for clones ofTransformerDecoderLayer
as per this.
If you don't register non-parameter members of the class, they are not moved to the correct device when doing model = model.to(device)
. Before I registered the mask as a buffer, my program was complaining I had some tensors on cuda
and others on cpu
.
mask = utils.get_subsequent_mask(context_length)
self.register_buffer("mask", mask, persistent=True)
Since I use nn.Embedding
layers, I used to use their forward pass like:
return x + self.embed(torch.arange(x.size(-2)))
However, when training on GPU, torch.arange
returns a tensor on the CPU, and then I also got tensors on different devices (cuda and cpu).
I fixed this by doing:
return x + self.embed.weight[:x.size(-2), :]
which just adds the first x.size(-2)
embedding vectors.