Skip to content

Fix grid dim overflow in DSA backward convert kernel on SM100#331

Open
szluyu99 wants to merge 1 commit into
NVIDIA:developfrom
szluyu99:fix/dsa-bwd-convert-grid-overflow
Open

Fix grid dim overflow in DSA backward convert kernel on SM100#331
szluyu99 wants to merge 1 commit into
NVIDIA:developfrom
szluyu99:fix/dsa-bwd-convert-grid-overflow

Conversation

@szluyu99

@szluyu99 szluyu99 commented Jun 30, 2026

Copy link
Copy Markdown

The convert kernel grid was configured as [1, convert_grid_x, 1], placing the seq-block dimension on grid.y. CUDA caps grid.y/z at 65535, so large mKV.shape[0] / block_seq values trigger invalid configuration argument. grid.x supports up to 2^31-1, so move convert_grid_x to grid.x and update the corresponding block_idx() unpacking in the kernel accordingly. No behavior change for in-range sizes.

Summary by CodeRabbit

  • Bug Fixes
    • Corrected how sparse attention backward work is distributed across compute tiles, improving consistency and correctness during execution.

The convert kernel grid was configured as [1, convert_grid_x, 1],
placing the seq-block dimension on grid.y. CUDA caps grid.y/z at
65535, so large mKV.shape[0] / block_seq values trigger
`invalid configuration argument`. grid.x supports up to 2^31-1, so
move convert_grid_x to grid.x and update the corresponding
block_idx() unpacking in the kernel accordingly. No behavior change
for in-range sizes.
@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

In the convert() kernel, the launch grid's first two dimensions are swapped so convert_grid_x occupies the first slot, and the cute.arch.block_idx() unpacking is updated to extract seq_block_idx from the first component instead of the second.

Changes

convert() kernel grid index fix

Layer / File(s) Summary
Grid layout and block_idx swap
python/cudnn/deepseek_sparse_attention/sparse_attention_backward/dsa_bwd_sm100.py
convert_grid reordered to place convert_grid_x first, and seq_block_idx now unpacked from the first block_idx() component to match.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Poem

A grid was flipped, a swap so neat,
First becomes first, the indices meet,
seq_block_idx hops to position one,
The convert kernel fix is done!
🐇✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main fix: moving the SM100 DSA backward convert kernel grid to avoid dimension overflow.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
python/cudnn/deepseek_sparse_attention/sparse_attention_backward/dsa_bwd_sm100.py (1)

573-577: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Document the grid/block index coupling.

This fix depends on convert_grid_x staying in grid.x and seq_block_idx continuing to come from block_idx.x. A short inline note here would help prevent a future cleanup from moving the sequence dimension back into grid.y and reintroducing the overflow.

Suggested diff
         convert_grid_x = (mKV.shape[0] + self.block_seq - 1) // self.block_seq
+        # Keep the sequence-block dimension in grid.x: CUDA grid.y/grid.z are capped at 65535.
+        # convert() must therefore read seq_block_idx from block_idx.x.
         convert_grid = [
             convert_grid_x,
             1,
             1,
         ]
@@
         (
+            # grid.x carries the sequence-block index to match convert_grid above.
             seq_block_idx,
             _,
             batch_idx,
         ) = cute.arch.block_idx()

As per path instructions, "Focus on documentation."

Also applies to: 619-622

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@python/cudnn/deepseek_sparse_attention/sparse_attention_backward/dsa_bwd_sm100.py`
around lines 573 - 577, Add a brief inline comment near the convert_grid/launch
setup in dsa_bwd_sm100.py documenting that convert_grid_x must remain in grid.x
and seq_block_idx must continue to be derived from block_idx.x; this coupling is
relied on by the backward kernel indexing, so avoid any future cleanup that
moves the sequence dimension to grid.y. Apply the same note to the related
launch section referenced in the later block as well, keeping the guidance tied
to the existing convert_grid_x and seq_block_idx symbols.

Source: Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In
`@python/cudnn/deepseek_sparse_attention/sparse_attention_backward/dsa_bwd_sm100.py`:
- Around line 573-577: Add a brief inline comment near the convert_grid/launch
setup in dsa_bwd_sm100.py documenting that convert_grid_x must remain in grid.x
and seq_block_idx must continue to be derived from block_idx.x; this coupling is
relied on by the backward kernel indexing, so avoid any future cleanup that
moves the sequence dimension to grid.y. Apply the same note to the related
launch section referenced in the later block as well, keeping the guidance tied
to the existing convert_grid_x and seq_block_idx symbols.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 53b7a4d0-3b2f-4852-819b-e2c7af70869a

📥 Commits

Reviewing files that changed from the base of the PR and between 8612893 and 55486fb.

📒 Files selected for processing (1)
  • python/cudnn/deepseek_sparse_attention/sparse_attention_backward/dsa_bwd_sm100.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant