Custom Language Model Builder

Repo for tutorial on creating a language model from scratch: https://substack.com/inbox/post/143369064. It includes scripts for training a tokenizer, setting up a transformer encoder, and running the training process on your dataset.

Features

Custom Tokenizer Training: Train a byte pair encoding (BPE) tokenizer training to adapt the custom vocabulary to your specific domain. You can also add your special tokens there.
Encoder-Only Transformer Model: Implements a custom transformer encoder to get embeddings for each tokens. Based on your data, you can different number of heads and layers, which means you have a full control over your model size.

Directory Structure

data/: contains training and test datasets in plain text format.
resources/: Includes the trained tokenizer file (bpe_tokenizer_banking77.json),
main.py: main training script for the language model.
preprocessing.py: Script to preprocess raw text data before feeding it to the model. Make sure you add your special tokens there if you want any.
train_tokenizer.py: Script to train the custom tokenizer.
transformer_model.py: Custom transformer encoder model.

Setup

Clone the repository and navigate to the project directory.

git clone https://github.com/AliLotfi92/LangModelBuilder.git
cd LangModelBuilder

Install the required dependencies:
```
pip install -r requirement.txt
```
Make sure your dataset is in the data/ directory. I used banking 77 dataset data/banking77_corpus.txt as the training corpus.

Usage

1. Train the Tokenizer

To train a new tokenizer, use train_tokenizer.py. The trained tokenizer file will be saved in the resources/ directory.

2. Train the Language Model

To train the language model, run main.py:

torchrun --nproc_per_node=4 main.py

The main.py script includes:

Tokenization: Uses a the custom trained tokenizer from resources/bpe_tokenizer_banking77.json.
Dataset Preparation: Dynamic Masking self supervised dataset using LineByLineTextDataset.

Note you can easily add mlflow to monitor the training process or for version controlling

Configuration

You can adjust the following parameters in main.py:

block_size: Sets the maximum token length for input sequences.
mlm_prob: Specifies the masking probability for MLM tasks.
batch_size: Adjust batch size in the DataLoader for efficient training.
vocab_size: Set the tokenizer vocabulary size during training.

Why Pretrain Your Custom Language Model

Want to see the difference between your custom model (not using labels) and Pretrained BERT?

Using your custom LM embeddings, you can review the quality of your labels for your downstream task. Please note that I have not leveraged the labels in training at all.

Contributing

Feel free to contribute to this repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Custom Language Model Builder

Features

Directory Structure

Setup

Usage

1. Train the Tokenizer

2. Train the Language Model

Configuration

Why Pretrain Your Custom Language Model

Contributing

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
data		data
resources		resources
README.md		README.md
main.py		main.py
preprocessing.py		preprocessing.py
requirement.txt		requirement.txt
train_tokenizer.py		train_tokenizer.py
transformer_model.py		transformer_model.py

AliLotfi92/LangModelBuilder

Folders and files

Latest commit

History

Repository files navigation

Custom Language Model Builder

Features

Directory Structure

Setup

Usage

1. Train the Tokenizer

2. Train the Language Model

Configuration

Why Pretrain Your Custom Language Model

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages