Skip to content

v3.0.2

Latest
Compare
Choose a tag to compare
@dame-cell dame-cell released this 02 Nov 14:44
· 53 commits to main since this release

Triformer

pip install -U triformer

TritonCrossEntropyLoss (New! πŸŽ‰)
The Triton implementation of Cross Entropy Loss is optimized for both performance and memory efficiency. It combines the forward and backward passes into a single CUDA kernel, which reduces memory overhead. A key feature is its in-place gradient computation, reusing the logits tensor instead of allocating new memory, resulting in about 2x memory savings compared to standard implementations. The code also supports chunked processing, allowing it to handle large batches by processing data in smaller pieces. For numerical stability, it implements the log-sum-exp trick, and it properly handles padding tokens through an ignore_index parameter. This makes it particularly efficient for large vocabulary sizes (30k-50k tokens) commonly found in language models.

from triformer import TritonCrossEntropyLoss