Fix grid dim overflow in DSA backward convert kernel on SM100#331
Fix grid dim overflow in DSA backward convert kernel on SM100#331szluyu99 wants to merge 1 commit into
Conversation
The convert kernel grid was configured as [1, convert_grid_x, 1], placing the seq-block dimension on grid.y. CUDA caps grid.y/z at 65535, so large mKV.shape[0] / block_seq values trigger `invalid configuration argument`. grid.x supports up to 2^31-1, so move convert_grid_x to grid.x and update the corresponding block_idx() unpacking in the kernel accordingly. No behavior change for in-range sizes.
📝 WalkthroughWalkthroughIn the Changesconvert() kernel grid index fix
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
python/cudnn/deepseek_sparse_attention/sparse_attention_backward/dsa_bwd_sm100.py (1)
573-577: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winDocument the grid/block index coupling.
This fix depends on
convert_grid_xstaying ingrid.xandseq_block_idxcontinuing to come fromblock_idx.x. A short inline note here would help prevent a future cleanup from moving the sequence dimension back intogrid.yand reintroducing the overflow.Suggested diff
convert_grid_x = (mKV.shape[0] + self.block_seq - 1) // self.block_seq + # Keep the sequence-block dimension in grid.x: CUDA grid.y/grid.z are capped at 65535. + # convert() must therefore read seq_block_idx from block_idx.x. convert_grid = [ convert_grid_x, 1, 1, ] @@ ( + # grid.x carries the sequence-block index to match convert_grid above. seq_block_idx, _, batch_idx, ) = cute.arch.block_idx()As per path instructions, "Focus on documentation."
Also applies to: 619-622
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@python/cudnn/deepseek_sparse_attention/sparse_attention_backward/dsa_bwd_sm100.py` around lines 573 - 577, Add a brief inline comment near the convert_grid/launch setup in dsa_bwd_sm100.py documenting that convert_grid_x must remain in grid.x and seq_block_idx must continue to be derived from block_idx.x; this coupling is relied on by the backward kernel indexing, so avoid any future cleanup that moves the sequence dimension to grid.y. Apply the same note to the related launch section referenced in the later block as well, keeping the guidance tied to the existing convert_grid_x and seq_block_idx symbols.Source: Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In
`@python/cudnn/deepseek_sparse_attention/sparse_attention_backward/dsa_bwd_sm100.py`:
- Around line 573-577: Add a brief inline comment near the convert_grid/launch
setup in dsa_bwd_sm100.py documenting that convert_grid_x must remain in grid.x
and seq_block_idx must continue to be derived from block_idx.x; this coupling is
relied on by the backward kernel indexing, so avoid any future cleanup that
moves the sequence dimension to grid.y. Apply the same note to the related
launch section referenced in the later block as well, keeping the guidance tied
to the existing convert_grid_x and seq_block_idx symbols.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 53b7a4d0-3b2f-4852-819b-e2c7af70869a
📒 Files selected for processing (1)
python/cudnn/deepseek_sparse_attention/sparse_attention_backward/dsa_bwd_sm100.py
The convert kernel grid was configured as [1, convert_grid_x, 1], placing the seq-block dimension on grid.y. CUDA caps grid.y/z at 65535, so large mKV.shape[0] / block_seq values trigger
invalid configuration argument. grid.x supports up to 2^31-1, so move convert_grid_x to grid.x and update the corresponding block_idx() unpacking in the kernel accordingly. No behavior change for in-range sizes.Summary by CodeRabbit