Skip to content

[Tutorial] Fix race condition in 2SM TMEM alloc for Cute Blackwell Tutorial 04/05 #3316

Open
pchen7e2 wants to merge 1 commit into
NVIDIA:mainfrom
pchen7e2:fix-2sm-tutorial
Open

[Tutorial] Fix race condition in 2SM TMEM alloc for Cute Blackwell Tutorial 04/05 #3316
pchen7e2 wants to merge 1 commit into
NVIDIA:mainfrom
pchen7e2:fix-2sm-tutorial

Conversation

@pchen7e2

@pchen7e2 pchen7e2 commented Jun 14, 2026

Copy link
Copy Markdown

Problem

There's a race condition between two pair CTAs when doing 2cta mode Tensor Memory allocation in the two tutorials.

PTX ISA specifies:

When .cta_group::2 is specified, the issuing warp must make sure that peer CTA is launched and its warps eventually participate in collective operations.

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#tcgen05-instructions-tcgen05-alloc-dealloc-relinquish-alloc-permit

To demonstrate the issue, run the kernel with compute-sanitizer from CUDA 13.1:

$ /usr/local/cuda-13.1/bin/compute-sanitizer  --tool racecheck --racecheck-report all --print-limit 1 ./examples/cute/tutorial/blackwell/cute_tutorial_04_mma_tma_2sm_sm100

========= COMPUTE-SANITIZER
...
========= Error: Potential RAW hazard detected (CUDA barrier operation) at __shared__ 0xf000048 in block (0,0,0) :
=========     Write Thread (31,0,0) (block rank 15) at void gemm_device<SharedStorage<cutlass::half_t, cutlass::half_t ... float>(T2, T3, T4, T5, T6, T7, T8, T9, T10, T11, T12)+0xffffffffffffff10
=========     Read Thread (0,0,0) (block rank 14) at cute::TMEM::Allocator2Sm::allocate(int, unsigned int *)+0xdf50 in tmem_allocator_sm100.hpp:141
=========         Device Frame: void gemm_device<SharedStorage<cutlass::half_t, c ... float>(T2, T3, T4, T5, T6, T7, T8, T9, T10, T11, T12)+0xdcb0 in 04_mma_tma_2sm_sm100.cu:247
...
========= RACECHECK SUMMARY: 1 hazard displayed (65 errors, 0 warnings)

Reading through SASS and this report my rough guess would be that internally the two CTAs use an internal mbarrier to notify each other when doing various read/write for the TMEM alloc. The initialization of that mbarrier is outside of the kernel code. If one CTA arrives peer's mbarrier before it's properly initialized, it's UB and may cause a hang.

Fix

We can simply fix it by having an additional cluster level sync (cluster bar arrive + wait) before the tmem alloc to make sure both CTAs are ready.

With the fix compute-sanitizer no longer complains:

$ /usr/local/cuda-13.1/bin/compute-sanitizer  --tool racecheck --racecheck-report all --print-limit 1 ./examples/cute/tutorial/blackwell/cute_tutorial_04_mma_tma_2sm_sm100
...
========= RACECHECK SUMMARY: 0 hazards displayed (0 errors, 0 warnings)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant