This project is an implementation of a Language Model (LLM) training framework in Scala. It provides a set of modules and utilities for building, training, and evaluating language models using the transformer architecture.
Inspired by the llm.c project, this Scala version aims to provide a clean, efficient, and extensible codebase for training language models.
- Transformer-based language model architecture
- Multi-head self-attention mechanism
- Positional encoding for sequence information
- Feed-forward neural network layers
- Embedding layer for input tokens
- Layer normalization for stable training
- GELU activation function
- Adam optimizer for parameter updates
- Data loading and batching utilities
- Tokenization and vocabulary handling
- Test suite for all modules
The project follows a standard Scala project structure:
llm-training/
├── build.sbt
└── src/
├── main/
│ └── scala/
│ └── llm/
│ ├── Config.scala
│ ├── Model.scala
│ ├── Attention.scala
│ ├── LayerNorm.scala
│ ├── GELU.scala
│ ├── Embedding.scala
│ ├── PositionalEncoding.scala
│ ├── FeedForward.scala
│ ├── Transformer.scala
│ ├── Optimizer.scala
│ ├── DataLoader.scala
│ ├── Tokenizer.scala
│ ├── Utils.scala
│ └── Main.scala
└── test/
└── scala/
└── llm/
├── ModelSpec.scala
├── AttentionSpec.scala
├── LayerNormSpec.scala
├── GELUSpec.scala
├── EmbeddingSpec.scala
├── PositionalEncodingSpec.scala
├── FeedForwardSpec.scala
├── TransformerSpec.scala
├── OptimizerSpec.scala
├── DataLoaderSpec.scala
├── TokenizerSpec.scala
└── UtilsSpec.scala
src/main/scala/llm/
: Contains the main source code for the language model implementation.src/test/scala/llm/
: Contains the test specifications for each module.build.sbt
: The build configuration file for the Scala project.project/
: Contains the sbt version and plugin configuration.
- Scala 2.13.8
- sbt 1.5.5
-
Clone the repository:
git clone https://github.com/wassemgtk/llm.scala.git
-
Navigate to the project directory:
cd llm-training
-
Compile the project:
sbt compile
To train the language model, follow these steps:
-
Prepare your training data:
- Place your training data file (e.g.,
tiny_shakespeare_train.bin
) in thedata/
directory. - Update the
dataFile
value inMain.scala
to point to your training data file.
- Place your training data file (e.g.,
-
Configure the model hyperparameters in the
Config
case class inConfig.scala
. -
Run the training script:
sbt run
-
Monitor the training progress and metrics logged to the console.
To generate text using a trained model, follow these steps:
-
Make sure you have a trained model checkpoint in the
checkpoints/
directory. -
Update the
modelCheckpoint
value inMain.scala
to point to your trained model checkpoint file. -
Set the desired generation parameters (e.g.,
maxNewTokens
,temperature
) in theMain
object. -
Run the text generation script:
sbt run
-
The generated text will be printed to the console.
To run the test suite and ensure the correctness of the implemented modules, use the following command:
sbt test
This will execute all the test specifications in the src/test/scala/llm/
directory.
The Config
case class in src/main/scala/llm/Config.scala
contains the hyperparameters and configuration settings for the language model. You can modify these values to experiment with different model architectures and training setups.
During training, the model checkpoints will be saved in the checkpoints/
directory. You can use these checkpoints to resume training from a previous state or to generate text using a trained model.
The project uses the Logback logging library for logging purposes. You can configure the logging settings in the src/main/resources/logback.xml
file.
Contributions to this project are welcome! If you find any issues or have suggestions for improvements, please open an issue or submit a pull request on the GitHub repository.
This project is licensed under the MIT License.
- This project is inspired by the llm.c project by Andrej Karpathy.
- The transformer architecture is based on the paper "Attention Is All You Need" by Vaswani et al.
- The implementation draws inspiration from various open-source language model implementations in the Scala ecosystem.