Triformer

High-performance Transformer components accelerated with Triton CUDA kernels

Overview

Triformer is a high-performance deep learning library that implements transformer components using Triton kernels for efficient CUDA acceleration.

Note: It is not the best but it works :D

Features

🚀 Highly optimized CUDA kernels via Triton
📊 Significant performance improvements over PyTorch implementations
🧮 Memory-efficient operations
🔧 Drop-in replacements for common PyTorch modules

Installation

pip install -U triformer

Components

Layer Normalization

import torch
from triformer import TritonLayerNorm

batch_size, seq_len, hidden_dim = 32, 64, 512
x = torch.randn(batch_size, seq_len, hidden_dim).cuda()

layer_norm = TritonLayerNorm(hidden_dim).cuda()
output = layer_norm(x)

Softmax

from triformer import TritonSoftmax

# Standard Softmax
softmax = TritonSoftmax(is_causal=False).cuda()
output = softmax(attention_scores)

# Causal Softmax (for decoder attention)
causal_softmax = TritonSoftmax(is_causal=True).cuda()
causal_output = causal_softmax(attention_scores)

Dropout

from triformer import TritonDropout

x = torch.ones(batch_size, seq_len, hidden_dim).cuda()
output = TritonDropout.apply(x, dropout_prob=0.5, seed=42)

Cross Entropy Loss

from triformer import TritonCrossEntropyLoss

criterion = TritonCrossEntropyLoss(
    pad_token_id=0,
    reduction='mean',
    n_chunks=1
).cuda()


loss = criterion(logits, targets)

Examples

You can now try out the GPT2 archtecture in the examples directory.

Performance Benchmarks

All benchmarks were conducted on NVIDIA L40s GPUs with float16.

Layer Normalization Performance

Forward	Backward	Combined

Softmax Performance

Forward	Backward	Combined

RMS Normalization Performance

will be benchmarking again

SwiGLU Performance

will be benchmarking again

Memory Efficiency (Cross Entropy Loss)

Our cross entropy implementation achieves significant memory reduction through:

In-Place Gradient Computation
- Reuses logits tensor for gradient storage
- ~2x memory reduction vs PyTorch implementation
- Optimal for large vocabulary sizes (30k-50k tokens)
Micro-batch Processing
- Configurable chunk size for memory-compute trade-off
- Enables larger batch sizes with limited GPU memory

Testing

# Clone the repository
git clone https://github.com/dame-cell/Triformer.git
cd Triformer/tests

# Install dependencies
pip install -U triformer

# Run tests
pytest test_layernorm.py
pytest test_softmax.py
pytest test_dropout.py
pytest test_cross_entropy.py

Roadmap

License

license is under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 254 Commits
examples		examples
tests		tests
triformer		triformer
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Triformer

Overview

Features

Installation

Components

Layer Normalization

Softmax

Dropout

Cross Entropy Loss

Examples

Performance Benchmarks

Layer Normalization Performance

Softmax Performance

RMS Normalization Performance

SwiGLU Performance

Memory Efficiency (Cross Entropy Loss)

Testing

Roadmap

License

About

Releases 5

Packages

Languages

License

dame-cell/Triformer

Folders and files

Latest commit

History

Repository files navigation

Triformer

Overview

Features

Installation

Components

Layer Normalization

Softmax

Dropout

Cross Entropy Loss

Examples

Performance Benchmarks

Layer Normalization Performance

Softmax Performance

RMS Normalization Performance

SwiGLU Performance

Memory Efficiency (Cross Entropy Loss)

Testing

Roadmap

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 5

Packages 0

Languages

Packages