Skip to content

Releases: dame-cell/Triformer

v3.0.2

02 Nov 14:44
Compare
Choose a tag to compare

Triformer

pip install -U triformer

TritonCrossEntropyLoss (New! 🎉)
The Triton implementation of Cross Entropy Loss is optimized for both performance and memory efficiency. It combines the forward and backward passes into a single CUDA kernel, which reduces memory overhead. A key feature is its in-place gradient computation, reusing the logits tensor instead of allocating new memory, resulting in about 2x memory savings compared to standard implementations. The code also supports chunked processing, allowing it to handle large batches by processing data in smaller pieces. For numerical stability, it implements the log-sum-exp trick, and it properly handles padding tokens through an ignore_index parameter. This makes it particularly efficient for large vocabulary sizes (30k-50k tokens) commonly found in language models.

from triformer import TritonCrossEntropyLoss

v3.0.1

01 Nov 14:53
Compare
Choose a tag to compare

Triformer

pip install -U triformer

TritonDropout (New! 🎉)

A fast and memory-efficient dropout implementation using Triton's parallel processing capabilities. Features deterministic dropout patterns with seed control and optimized memory usage through block processing. Benchmarks show comparable or better training convergence compared to PyTorch's native dropout implementation.

Example usage:

from triformer import TritonDropout

# Basic usage
output = TritonDropout.apply(x, p=0.5)

# With deterministic seed
output = TritonDropout.apply(x, p=0.5, seed=42)

v3.0.0

28 Oct 14:29
Compare
Choose a tag to compare

TritonLayerNorm:
Implemented Layer Normalization in Triton, designed to improve the stability and performance of transformer models. This implementation leverages Triton’s capabilities for optimized computation, allowing for faster training and inference times.

TritonSoftmax:
Introduced an efficient Softmax implementation in Triton. This addition enables more effective processing of output layers in neural networks, particularly in tasks requiring probabilistic outputs.

Usage

from triformer import TritonLayerNorm
from triformer import TritonSoftmax 

1.3.4

26 Oct 09:50
Compare
Choose a tag to compare

change the data type to float32

1.1.0

26 Oct 10:31
Compare
Choose a tag to compare

This release completes the implementation of our custom linear layer by adding a fully-functional backward pass, enabling end-to-end training.

What's New

  • Complete Backward Pass Implementation: Added three specialized Triton kernels for efficient backpropagation:

    • backward_input_kernel: Computes gradients with respect to input
    • backward_weight_kernel: Computes gradients with respect to weights
    • fused_relu_bias_backward_kernel: Fused computation of bias gradients and ReLU backward pass
  • Performance Optimizations:

    • Kernel fusion to minimize memory operations
    • Autotuned configurations for optimal performance
    • Block-based computation patterns for efficient GPU utilization
    • Mixed precision (float32 for computation, float16 for storage)

Previous Release

  • Forward pass implementation with fused linear transformation and ReLU activation

Technical Details

The backward pass maintains the same performance philosophy as the forward pass:

  • Leverages Triton for GPU acceleration
  • Uses autotuning to optimize kernel configurations
  • Implements efficient memory access patterns
  • Maintains numerical stability through careful handling of data types

Usage

The layer can now be used as a drop-in replacement for nn.Linear in training scenarios:

layer = TritonLinear(in_features=512, out_features=256)
optimizer = torch.optim.Adam(layer.parameters())
Forward pass
output = layer(input_tensor)
loss = criterion(output, target)
Backward pass (now supported!)
loss.backward()
optimizer.step()