[docs/examples] Blackwell tutorial 04 (2SM MMA): fix stale B-operand print annotations (the 2x1SM atom splits B N/2-per-CTA) and document the expect-tx pitfalls that deadlock#3314
Open
cfregly wants to merge 1 commit into
Conversation
…atom splits B N/2-per-CTA) and warn about the expect-tx deadlock pitfalls The SM100_MMA_F16BF16_2x1SM_SS atom splits the B operand N/2 per CTA across the pair; the inline '// printed:' annotations still showed full-N shapes (((_256,_16),...) where partition_shape_B actually returns ((_128,_16),...) with the tutorial's own TiledMMA, verified on CUTLASS main and 4.2.0). Also fixes the NumMma_M -> NumMma_N comment typos on the B-tensor modes, and adds a warning above the tma_transaction_bytes computation: the tma_partition slice is an offset view (do not multiply by multicast participants) and B is the N/2 slice (do not size from the full tile) -- both mistakes over-expect the barrier, which then never fires. Annotation/comment-only change.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Two documentation problems in
examples/cute/tutorial/blackwell/04_mma_tma_2sm_sm100.cu, both of whichcost us real debugging time on GB300 (one of them as a 25-minute
GPU-pegged mbarrier deadlock, hit independently by two separate kernels):
1. Stale print annotations claim a full-N per-CTA B tile
The
SM100_MMA_F16BF16_2x1SM_SSatom splits the B operand N/2-per-CTAacross the CTA pair, but the inline
// printed:annotations in thetutorial still show full-N shapes. With the tutorial's own TiledMMA
(2x1SM 256x256 atom, MmaTiler 256x256x64), what
partition_shape_B/tile_to_mma_shapeactually return today (host-only check, CUTLASS mainand 4.2.0 alike —
snippet.cuattached):mma_shape_B: ((_256,_16),_1,_4)((_128,_16),_1,_4)sB_layout: Sw<3,4,3> o ... ((_256,_16),_1,_4):((_64,_1),_0,_16)... ((_128,_16),_1,_4):((_64,_1),_0,_16)tCsB / tCgB: ((_256,_16),...)((_128,_16),...)tBsB: ... o ((_16384,_1)):((_1,_0))((_8192,_1))(Also: the shape comments on
tCsB/tCrBread(MmaB, NumMma_M, NumMma_K, Tiles_K)— the second mode should beNumMma_N.)The hardware behavior matches the corrected prints, not the annotations.
On GB300 (sm_103) we confirmed with ncu on a kernel built from this
tutorial: per-CTA smem footprint is A-half 16 KiB + B-half 16 KiB per
stage at a 256-wide N tile (e.g. 64 KiB/CTA at 2 stages; a full-N B would
be 96 KiB and would change the reported Block Limit Shared Mem from 3 to
2 at the 128-col config), and the mbarrier transaction-byte balance only
closes with the halved B slice. Readers who trust the current annotations
over-budget smem 1.5x and mis-derive the expect-tx count (see below).
2. The expect-tx formula's two silent-deadlock pitfalls deserve a warning comment
The tutorial's formula is correct:
but nothing warns about the two natural-looking modifications that each
produce a barrier that NEVER fires — i.e. a deterministic kernel hang with
no error, the worst failure mode to debug:
tma_partition's multicast result is anoffset view into the stage buffer, not a shrunken tensor:
sizeof(make_tensor_like(slice))is already the full byte countdelivered into this CTA for that operand. Scaling it by the number of
multicast participants (e.g.
kClusterNwhen A is multicast across thecluster N-mode) over-expects the barrier. We hit exactly this twice
independently: once on a single-SM cluster-2 B-multicast kernel
(expect-tx armed 80 KiB vs 48 KiB delivered -> hang, localized via
cuda-gdb attach) and once on a (2,2,1) A-multicast variant of this very
tutorial pattern (
tma_bytes = 2 * (kClusterN * sizeof(tAsA_0) + ...)-> 25-minute GPU-pegged hang; deleting the
kClusterNfactor fixed it,rel_err 0.0at every size after).the stale annotations of problem 1 suggest) over-expects by the same
mechanism.
Two adjacent facts worth stating in the same comment (both verified on
sm_103): with
SM100_TMA_2SM_LOADboth CTAs of the pair issue loadsagainst their own barrier handle and the hardware redirects every arrival
to the even (leader) CTA's barrier, so only the leader calls
set_barrier_transaction_bytes— and a peer-CTA TMA completing beforethe leader's expect-tx is legal (the mbarrier transaction count may go
transiently negative).
Proposed change
Refresh the stale print annotations in
04_mma_tma_2sm_sm100.cu(table above) and fix the
NumMma_M/NumMma_Ncomment typos.Add a short warning comment above the
tma_transaction_bytescomputation:
Optionally mirror one sentence on 2x1SM operand residency
(A: M-half per CTA, B: N/2 per CTA, C/TMEM: own 128 rows x N) into the
Blackwell functionality docs.
No code behavior changes — annotations and comments only. Happy to split
1 and 2 into separate commits if preferred.
Evidence / environment
snippet.cu, attached): builds withnvcc -std=c++20 -arch=sm_103a -I$CUTLASS_DIR/include, no GPU needed;output above reproduced against both CUTLASS 4.2.0 and current main
headers (2026-06).
driver 580.159.03 — ncu smem-footprint/occupancy readings and mbarrier
tx-byte balance on a warp-specialized 2SM GEMM built from this tutorial
pattern; both deadlock incidents reproduced and fixed by the one-line
formula corrections described above.
snippet.cu