Releases: dame-cell/Triformer
v3.0.2
Triformer
pip install -U triformer
TritonCrossEntropyLoss (New! 🎉)
The Triton implementation of Cross Entropy Loss is optimized for both performance and memory efficiency. It combines the forward and backward passes into a single CUDA kernel, which reduces memory overhead. A key feature is its in-place gradient computation, reusing the logits tensor instead of allocating new memory, resulting in about 2x memory savings compared to standard implementations. The code also supports chunked processing, allowing it to handle large batches by processing data in smaller pieces. For numerical stability, it implements the log-sum-exp trick, and it properly handles padding tokens through an ignore_index parameter. This makes it particularly efficient for large vocabulary sizes (30k-50k tokens) commonly found in language models.
from triformer import TritonCrossEntropyLoss
v3.0.1
Triformer
pip install -U triformer
TritonDropout (New! 🎉)
A fast and memory-efficient dropout implementation using Triton's parallel processing capabilities. Features deterministic dropout patterns with seed control and optimized memory usage through block processing. Benchmarks show comparable or better training convergence compared to PyTorch's native dropout implementation.
Example usage:
from triformer import TritonDropout
# Basic usage
output = TritonDropout.apply(x, p=0.5)
# With deterministic seed
output = TritonDropout.apply(x, p=0.5, seed=42)
v3.0.0
TritonLayerNorm:
Implemented Layer Normalization in Triton, designed to improve the stability and performance of transformer models. This implementation leverages Triton’s capabilities for optimized computation, allowing for faster training and inference times.
TritonSoftmax:
Introduced an efficient Softmax implementation in Triton. This addition enables more effective processing of output layers in neural networks, particularly in tasks requiring probabilistic outputs.
Usage
from triformer import TritonLayerNorm
from triformer import TritonSoftmax
1.3.4
1.1.0
This release completes the implementation of our custom linear layer by adding a fully-functional backward pass, enabling end-to-end training.
What's New
-
Complete Backward Pass Implementation: Added three specialized Triton kernels for efficient backpropagation:
backward_input_kernel
: Computes gradients with respect to inputbackward_weight_kernel
: Computes gradients with respect to weightsfused_relu_bias_backward_kernel
: Fused computation of bias gradients and ReLU backward pass
-
Performance Optimizations:
- Kernel fusion to minimize memory operations
- Autotuned configurations for optimal performance
- Block-based computation patterns for efficient GPU utilization
- Mixed precision (float32 for computation, float16 for storage)
Previous Release
- Forward pass implementation with fused linear transformation and ReLU activation
Technical Details
The backward pass maintains the same performance philosophy as the forward pass:
- Leverages Triton for GPU acceleration
- Uses autotuning to optimize kernel configurations
- Implements efficient memory access patterns
- Maintains numerical stability through careful handling of data types
Usage
The layer can now be used as a drop-in replacement for nn.Linear
in training scenarios:
layer = TritonLinear(in_features=512, out_features=256)
optimizer = torch.optim.Adam(layer.parameters())
Forward pass
output = layer(input_tensor)
loss = criterion(output, target)
Backward pass (now supported!)
loss.backward()
optimizer.step()