[Tutorial] Fix race condition in 2SM TMEM alloc for Cute Blackwell Tutorial 04/05 by pchen7e2 · Pull Request #3316 · NVIDIA/cutlass

pchen7e2 · 2026-06-14T18:27:26Z

Problem

There's a race condition between two pair CTAs when doing 2cta mode Tensor Memory allocation in the two tutorials.

PTX ISA specifies:

When .cta_group::2 is specified, the issuing warp must make sure that peer CTA is launched and its warps eventually participate in collective operations.

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#tcgen05-instructions-tcgen05-alloc-dealloc-relinquish-alloc-permit

To demonstrate the issue, run the kernel with compute-sanitizer from CUDA 13.1:

$ /usr/local/cuda-13.1/bin/compute-sanitizer  --tool racecheck --racecheck-report all --print-limit 1 ./examples/cute/tutorial/blackwell/cute_tutorial_04_mma_tma_2sm_sm100

========= COMPUTE-SANITIZER
...
========= Error: Potential RAW hazard detected (CUDA barrier operation) at __shared__ 0xf000048 in block (0,0,0) :
=========     Write Thread (31,0,0) (block rank 15) at void gemm_device<SharedStorage<cutlass::half_t, cutlass::half_t ... float>(T2, T3, T4, T5, T6, T7, T8, T9, T10, T11, T12)+0xffffffffffffff10
=========     Read Thread (0,0,0) (block rank 14) at cute::TMEM::Allocator2Sm::allocate(int, unsigned int *)+0xdf50 in tmem_allocator_sm100.hpp:141
=========         Device Frame: void gemm_device<SharedStorage<cutlass::half_t, c ... float>(T2, T3, T4, T5, T6, T7, T8, T9, T10, T11, T12)+0xdcb0 in 04_mma_tma_2sm_sm100.cu:247
...
========= RACECHECK SUMMARY: 1 hazard displayed (65 errors, 0 warnings)

Reading through SASS and this report my rough guess would be that internally the two CTAs use an internal mbarrier to notify each other when doing various read/write for the TMEM alloc. The initialization of that mbarrier is outside of the kernel code. If one CTA arrives peer's mbarrier before it's properly initialized, it's UB and may cause a hang.

Fix

We can simply fix it by having an additional cluster level sync (cluster bar arrive + wait) before the tmem alloc to make sure both CTAs are ready.

With the fix compute-sanitizer no longer complains:

$ /usr/local/cuda-13.1/bin/compute-sanitizer  --tool racecheck --racecheck-report all --print-limit 1 ./examples/cute/tutorial/blackwell/cute_tutorial_04_mma_tma_2sm_sm100
...
========= RACECHECK SUMMARY: 0 hazards displayed (0 errors, 0 warnings)

Add cluster sync before 2SM tmem alloc

94fb8c5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tutorial] Fix race condition in 2SM TMEM alloc for Cute Blackwell Tutorial 04/05 #3316

[Tutorial] Fix race condition in 2SM TMEM alloc for Cute Blackwell Tutorial 04/05 #3316
pchen7e2 wants to merge 1 commit into
NVIDIA:mainfrom
pchen7e2:fix-2sm-tutorial

pchen7e2 commented Jun 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pchen7e2 commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pchen7e2 commented Jun 14, 2026 •

edited

Loading