[CuTeDSL] Distributed example, using TMALDG to access remote memory rank-by-rank, reducing in cta, broadcast result to all ranks by multimem TMASTG #2970
+691
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Add TMA-based distributed all-reduce example (all_reduce_tma.py)
A tutorial example demonstrating TMA usage for distributed all-reduce operations across multiple GPUs.
Key features:
Note: This example prioritizes clarity over performance optimization, serving as a learning resource for TMA-based distributed operations.