This project implements and trains LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit) networks on the Penn Tree Bank (PTB) dataset using PyTorch. The study compares different model configurations to evaluate their effectiveness in language modeling tasks.
- Dataset
- Model Architectures
- Model Variants
- Training Process
- Results
- Conclusions
- How to Train and Test
- Reference
The Penn Tree Bank (PTB) dataset is utilized for training, validating, and testing the language models. It consists of roughly:
- Training Set: 929,589 words
- Validation Set: 73,760 words
- Test Set: 82,430 words
The dataset includes sequences of word tokens, each represented by an index from a vocabulary of 10,000 most frequent words.
The LSTM model is designed to capture long-term dependencies in sequential data. Its architecture includes:
- Embedding Layer: Converts token indices into 200-dimensional vectors.
- LSTM Layers: Two fully-connected LSTM layers, each with 200 hidden units.
- Dropout Layer: Applied after the embedding layer and in hidden LSTM layers with configurable dropout probability to prevent overfitting.
- Fully Connected Layer: Maps LSTM outputs to the vocabulary size (10,000 unique tokens), producing logits for each token prediction.
The GRU model offers a streamlined alternative to LSTM with fewer parameters:
- Embedding Layer: Similar to LSTM, it transforms token indices into 200-dimensional vectors.
- GRU Layers: Two fully-connected GRU layers with 200 hidden units each.
- Dropout Layer: Applied post-embedding and in GRU hidden layers with configurable dropout probability.
- Fully Connected Layer: Maps LSTM outputs to the vocabulary size (10,000 unique tokens), producing logits for each token prediction.
Four experimental configurations were evaluated to assess the impact of regularization techniques:
-
LSTM without Dropout
- Architecture: Standard LSTM with no dropout.
- Purpose: Baseline to understand model performance.
-
LSTM with Dropout
- Architecture: LSTM with a dropout probability of 0.5 applied after the embedding layer and in hidden layers.
- Purpose: Evaluate dropout's role in preventing overfitting and generalization.
-
GRU without Dropout
- Architecture: Standard GRU with no dropout.
- Purpose: Baseline to understand model performance.
-
GRU with Dropout
- Architecture: GRU with a dropout probability of 0.3 applied after the embedding layer.
- Purpose: Evaluate dropout's role in preventing overfitting and generalization.
Hyperparameter | Value |
---|---|
Batch Size | 20 |
Sequence Length | 20 |
Vocabulary Size | 10,000 |
Embedding Size | 200 |
Hidden Size | 200 |
Number of Layers | 2 |
Dropout Probability | 0.0 / 0.5 / 0.3 |
Learning Rate | 1.6 / 3.4 / 1.5 / 1.8 |
Optimizer | SGD |
Learning Rate Scheduler | LambdaLR |
Number of Epochs | 13 / 20 |
For each epoch:
-
Training Phase:
- Forward pass through the model.
- Compute loss and perform backpropagation.
- Update model parameters using SGD.
-
Validation Phase:
- Evaluate model performance on the validation set.
- Calculate perplexity to assess language modeling capability.
-
Testing Phase:
- After training completion, evaluate the best model on the test set.
-
Checkpointing:
- Save model state when validation perplexity improves.
-
Learning Rate Adjustment:
- Update learning rate according to the scheduler.
-
Logging and Visualization:
- Record perplexity scores.
- Generate and save perplexity plots for analysis.
- Generate and save table for best perplexities.
Model | Dropout Probability | Train Perplexity | Validation Perplexity | Test Perplexity |
---|---|---|---|---|
LSTM | 0.0 | 69.39 | 122.04 | 118.74 |
LSTM | 0.5 | 105.38 | 109.08 | 105.22 |
GRU | 0.0 | 51.11 | 121.48 | 117.32 |
GRU | 0.3 | 72.83 | 104.29 | 100.62 |
-
Base Models (LSTM and GRU without Dropout):
- Achieved lower training perplexities but higher validation and test perplexities, indicating overfitting.
-
Regularized Models (with Dropout):
- Higher training perplexities but significantly improved validation and test perplexities, demonstrating enhanced generalization.
-
GRU vs. LSTM:
- GRU models trained faster, likely due to their simpler architecture. They also show better performance on the validation and test set when compared to LSTMs.
-
Dropout Effectiveness:
- Dropout effectively reduced overfitting, as evidenced by better performance on validation and test sets across both architectures.
Training and Evaluating the Model: To train and evaluate the models, the final block of code should be executed within the provided notebook. This block initializes four different model configurations and runs a training and testing loop for each one, over a predefined number of epochs. Once this block is run, each model will undergo training for the specified number of epochs. During each epoch, the code evaluates the model on both the training, validation, and test datasets to monitor performance and generalization. The best model for each configuration is saved based on test accuracy, and the results are used to generate convergence graphs and a final accuracy comparison table.