[Tutorial] Fix race condition in 2SM TMEM alloc for Cute Blackwell Tutorial 04/05 #3316
Open
pchen7e2 wants to merge 1 commit into
Open
[Tutorial] Fix race condition in 2SM TMEM alloc for Cute Blackwell Tutorial 04/05 #3316pchen7e2 wants to merge 1 commit into
pchen7e2 wants to merge 1 commit into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
There's a race condition between two pair CTAs when doing 2cta mode Tensor Memory allocation in the two tutorials.
PTX ISA specifies:
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#tcgen05-instructions-tcgen05-alloc-dealloc-relinquish-alloc-permit
To demonstrate the issue, run the kernel with
compute-sanitizerfrom CUDA 13.1:Reading through SASS and this report my rough guess would be that internally the two CTAs use an internal mbarrier to notify each other when doing various read/write for the TMEM alloc. The initialization of that mbarrier is outside of the kernel code. If one CTA arrives peer's mbarrier before it's properly initialized, it's UB and may cause a hang.
Fix
We can simply fix it by having an additional cluster level sync (cluster bar arrive + wait) before the tmem alloc to make sure both CTAs are ready.
With the fix
compute-sanitizerno longer complains: