Repo for tutorial on creating a language model from scratch: https://substack.com/inbox/post/143369064. It includes scripts for training a tokenizer, setting up a transformer encoder, and running the training process on your dataset.
- Custom Tokenizer Training: Train a byte pair encoding (BPE) tokenizer training to adapt the custom vocabulary to your specific domain. You can also add your special tokens there.
- Encoder-Only Transformer Model: Implements a custom transformer encoder to get embeddings for each tokens. Based on your data, you can different number of heads and layers, which means you have a full control over your model size.
- data/: contains training and test datasets in plain text format.
- resources/: Includes the trained tokenizer file (
bpe_tokenizer_banking77.json
), - main.py: main training script for the language model.
- preprocessing.py: Script to preprocess raw text data before feeding it to the model. Make sure you add your special tokens there if you want any.
- train_tokenizer.py: Script to train the custom tokenizer.
- transformer_model.py: Custom transformer encoder model.
-
Clone the repository and navigate to the project directory.
git clone https://github.com/AliLotfi92/LangModelBuilder.git cd LangModelBuilder
-
Install the required dependencies:
pip install -r requirement.txt
-
Make sure your dataset is in the
data/
directory. I used banking 77 datasetdata/banking77_corpus.txt
as the training corpus.
To train a new tokenizer, use train_tokenizer.py
. The trained tokenizer file will be saved in the resources/
directory.
To train the language model, run main.py
:
torchrun --nproc_per_node=4 main.py
The main.py
script includes:
- Tokenization: Uses a the custom trained tokenizer from
resources/bpe_tokenizer_banking77.json
. - Dataset Preparation: Dynamic Masking self supervised dataset using
LineByLineTextDataset
.
Note you can easily add mlflow
to monitor the training process or for version controlling
You can adjust the following parameters in main.py
:
block_size
: Sets the maximum token length for input sequences.mlm_prob
: Specifies the masking probability for MLM tasks.batch_size
: Adjust batch size in theDataLoader
for efficient training.vocab_size
: Set the tokenizer vocabulary size during training.
Want to see the difference between your custom model (not using labels) and Pretrained BERT?
Using your custom LM embeddings, you can review the quality of your labels for your downstream task. Please note that I have not leveraged the labels in training at all.
Feel free to contribute to this repository.