This repository contains a PyTorch implementation of a Transformer model with sparse attention patterns. The goal is to explore and implement various sparse attention mechanisms to improve the efficiency of transformer models while maintaining performance.
- Local sparse attention mechanism (window-based)
- Configurable model architecture (layers, heads, dimensions)
- Basic positional encoding
- Simple training loop for sequence prediction
- CPU support
# Clone the repository
git clone https://github.com/Sunsvea/cl-sparse-transformer.git
cd cl-sparse-transformer
# Install dependencies
pip install torch
# Run the example training script
python sparse_transformer.py
When running the training script, you should see output similar to this:
2025-01-31 10:15:23,456 - INFO - Starting training...
2025-01-31 10:15:23,789 - INFO - Generated 1000 sample sequences...
2025-01-31 10:15:23,901 - INFO - Split data into 800 train and 200 validation sequences
2025-01-31 10:15:24,123 - INFO - Epoch 1/5
2025-01-31 10:15:24,456 - INFO - Batch 0, Loss: 4.6573
2025-01-31 10:15:24,789 - INFO - Batch 10, Loss: 4.3291
2025-01-31 10:15:25,012 - INFO - Training Loss: 4.2845
2025-01-31 10:15:25,234 - INFO - Validation Loss: 4.1932
2025-01-31 10:15:25,345 - INFO - Saved new best model checkpoint
2025-01-31 10:15:25,456 - INFO - Epoch completed in 1.33s
[...]
2025-01-31 10:15:35,678 - INFO - Training completed in 12.22s
2025-01-31 10:15:35,789 - INFO - Best validation loss: 3.2456
The model saves checkpoints to ./checkpoints/
whenever the validation loss improves.
The current implementation includes:
LocalSparseAttention
: Implements window-based sparse attention where each token attends only to its neighborsSparseTransformerBlock
: A single transformer block with sparse attentionSparseTransformer
: The full model with embedding layer and multiple transformer blocks
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
MIT
If you use this code in your research, please cite:
@software{sparse_transformer2025,
author = {Dean Coulstock},
title = {Sparse Transformer Implementation},
year = {2025},
publisher = {GitHub},
url = {https://github.com/Sunsvea/sparse-transformer}
}
- Dean Coulstock
- [email protected]
- LinkedIn: https://www.linkedin.com/in/dean-coulstock/
This implementation draws inspiration from:
- "Generating Long Sequences with Sparse Transformers" (Child et al., 2019)
- "Longformer: The Long-Document Transformer" (Beltagy et al., 2020)