Skip to content

dame-cell/Triformer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Triformer

triformer logo

High-performance Transformer components accelerated with Triton CUDA kernels

PyPI version License: MIT

Overview

Triformer is a high-performance deep learning library that implements transformer components using Triton kernels for efficient CUDA acceleration.

Note: It is not the best but it works :D

Features

  • 🚀 Highly optimized CUDA kernels via Triton
  • 📊 Significant performance improvements over PyTorch implementations
  • 🧮 Memory-efficient operations
  • 🔧 Drop-in replacements for common PyTorch modules

Installation

pip install -U triformer

Components

Layer Normalization

import torch
from triformer import TritonLayerNorm

batch_size, seq_len, hidden_dim = 32, 64, 512
x = torch.randn(batch_size, seq_len, hidden_dim).cuda()

layer_norm = TritonLayerNorm(hidden_dim).cuda()
output = layer_norm(x)

Softmax

from triformer import TritonSoftmax

# Standard Softmax
softmax = TritonSoftmax(is_causal=False).cuda()
output = softmax(attention_scores)

# Causal Softmax (for decoder attention)
causal_softmax = TritonSoftmax(is_causal=True).cuda()
causal_output = causal_softmax(attention_scores)

Dropout

from triformer import TritonDropout

x = torch.ones(batch_size, seq_len, hidden_dim).cuda()
output = TritonDropout.apply(x, dropout_prob=0.5, seed=42)

Cross Entropy Loss

from triformer import TritonCrossEntropyLoss

criterion = TritonCrossEntropyLoss(
    pad_token_id=0,
    reduction='mean',
    n_chunks=1
).cuda()


loss = criterion(logits, targets)

Examples

You can now try out the GPT2 archtecture in the examples directory.

Performance Benchmarks

All benchmarks were conducted on NVIDIA L40s GPUs with float16.

Layer Normalization Performance

Forward Backward Combined
LayerNorm Forward LayerNorm Backward LayerNorm Combined

Softmax Performance

Forward Backward Combined
Softmax Forward Softmax Backward Softmax Combined

RMS Normalization Performance

will be benchmarking again

SwiGLU Performance

will be benchmarking again

Memory Efficiency (Cross Entropy Loss)

Our cross entropy implementation achieves significant memory reduction through:

  1. In-Place Gradient Computation

    • Reuses logits tensor for gradient storage
    • ~2x memory reduction vs PyTorch implementation
    • Optimal for large vocabulary sizes (30k-50k tokens)
  2. Micro-batch Processing

    • Configurable chunk size for memory-compute trade-off
    • Enables larger batch sizes with limited GPU memory

Memory Usage Comparison

Testing

# Clone the repository
git clone https://github.com/dame-cell/Triformer.git
cd Triformer/tests

# Install dependencies
pip install -U triformer

# Run tests
pytest test_layernorm.py
pytest test_softmax.py
pytest test_dropout.py
pytest test_cross_entropy.py

Roadmap

  • Only Language Transformer Library
  • Core Operations:
    • LayerNorm
    • Softmax
    • Dropout
    • Cross Entropy Loss
  • Llama2 Transformer architecture
    • RMSNorm
    • RoPE
    • SwiGLU (still working on a better and more efficient)

License

license is under the MIT License - see the LICENSE file for details.