Fix triton rotary kernel for training on context lengths > 65k #386

akshaykalkunte · 2025-11-13T03:37:30Z

✨ Description

Long context training beyond 65k throws errors in RoPE triton implementation because -

gridDim.x is 32 bit while gridDim.y and gridDim.z are 16 bit registers.
Input to the kernel of shape [bsz, seq_len, head_size] leads to y dimension going out of bounds (65536) for seq_len > 65536

Solution is to reshape [bsz, seq_len, head_size] -> [bsz * seq_len, head_size] before passing to the triton kernel.

🔍 Type of change

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

List the key changes introduced in this PR:

Fix triton rotary kernel.
Update unit test with seq_len > 65k.
Do a small training < 65k to show no changes and another > 65k to show that the problem is fixed.

✅ Checklist

Make sure the following tasks are completed before submitting the PR:

General

📜 I have read and followed the contributing guidelines.
🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
🎉 The functionality is complete, and I have tested the changes.
📝 I have updated the documentation if needed.
⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
🧩 I have commented my code, especially in hard-to-understand areas.

Dependencies and Configuration

🐋 I have updated the Docker configuration or dependencies, if applicable.
🔄 I have ensured compatibility with the existing setup after dependency changes.

Testing

🧪 I have added or updated tests to cover my changes.
✔️ New and existing tests pass locally with my changes.
🚦 I have tested these changes on GPUs and verified training stability.
🏋️ I have tested the changes on realistic training workloads, if applicable.

Performance Impact

📊 I have run benchmarks where applicable to evaluate the performance impact.
✅ The benchmarks show no performance regression.
🚀 The benchmarks indicate a potential performance improvement.
⚠️ The benchmarks indicate a potential performance degradation.
📈 I have provided benchmark results and detailed any performance impact below, if applicable.

📊 Performance Impact Details

NA

🗒️ Additional Notes

NA

Fix triton rotary kernel

980117a

tscholak requested a review from jlamypoirier November 13, 2025 15:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix triton rotary kernel for training on context lengths > 65k #386

Fix triton rotary kernel for training on context lengths > 65k #386

Uh oh!

akshaykalkunte commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix triton rotary kernel for training on context lengths > 65k #386

Are you sure you want to change the base?

Fix triton rotary kernel for training on context lengths > 65k #386

Uh oh!

Conversation

akshaykalkunte commented Nov 13, 2025

✨ Description

🔍 Type of change

📝 Changes

✅ Checklist

General

Dependencies and Configuration

Testing

Performance Impact

📊 Performance Impact Details

NA

🗒️ Additional Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants