Triformer is a high-performance deep learning library that implements transformer components using Triton kernels for efficient CUDA acceleration.
Note: It is not the best but it works :D
- 🚀 Highly optimized CUDA kernels via Triton
- 📊 Significant performance improvements over PyTorch implementations
- 🧮 Memory-efficient operations
- 🔧 Drop-in replacements for common PyTorch modules
pip install -U triformer
import torch
from triformer import TritonLayerNorm
batch_size, seq_len, hidden_dim = 32, 64, 512
x = torch.randn(batch_size, seq_len, hidden_dim).cuda()
layer_norm = TritonLayerNorm(hidden_dim).cuda()
output = layer_norm(x)
from triformer import TritonSoftmax
# Standard Softmax
softmax = TritonSoftmax(is_causal=False).cuda()
output = softmax(attention_scores)
# Causal Softmax (for decoder attention)
causal_softmax = TritonSoftmax(is_causal=True).cuda()
causal_output = causal_softmax(attention_scores)
from triformer import TritonDropout
x = torch.ones(batch_size, seq_len, hidden_dim).cuda()
output = TritonDropout.apply(x, dropout_prob=0.5, seed=42)
from triformer import TritonCrossEntropyLoss
criterion = TritonCrossEntropyLoss(
pad_token_id=0,
reduction='mean',
n_chunks=1
).cuda()
loss = criterion(logits, targets)
You can now try out the GPT2 archtecture in the examples directory.
All benchmarks were conducted on NVIDIA L40s GPUs with float16.
Forward | Backward | Combined |
---|---|---|
Forward | Backward | Combined |
---|---|---|
will be benchmarking again
will be benchmarking again
Our cross entropy implementation achieves significant memory reduction through:
-
In-Place Gradient Computation
- Reuses logits tensor for gradient storage
- ~2x memory reduction vs PyTorch implementation
- Optimal for large vocabulary sizes (30k-50k tokens)
-
Micro-batch Processing
- Configurable chunk size for memory-compute trade-off
- Enables larger batch sizes with limited GPU memory
# Clone the repository
git clone https://github.com/dame-cell/Triformer.git
cd Triformer/tests
# Install dependencies
pip install -U triformer
# Run tests
pytest test_layernorm.py
pytest test_softmax.py
pytest test_dropout.py
pytest test_cross_entropy.py
- Only Language Transformer Library
- Core Operations:
- LayerNorm
- Softmax
- Dropout
- Cross Entropy Loss
- Llama2 Transformer architecture
- RMSNorm
- RoPE
- SwiGLU (still working on a better and more efficient)
license is under the MIT License - see the LICENSE file for details.