[CuTeDSL] Distributed example, using TMALDG to access remote memory rank-by-rank, reducing in cta, broadcast result to all ranks by multimem TMASTG #2970

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

aragorn-guan wants to merge 1 commit into NVIDIA:main from aragorn-guan:tma_distribute_example

+691 −0

aragorn-guan commented Jan 21, 2026

Add TMA-based distributed all-reduce example (all_reduce_tma.py)

A tutorial example demonstrating TMA usage for distributed all-reduce operations across multiple GPUs.

Key features:

Uses TMALDG.1D to load from remote GPU memory via NVSHMEM addresses
Uses TMASTG.1D to store to multicast address for broadcasting
Supports any input shape by flattening to 1D and tiling linearly
Two-stage pipeline overlaps TMA loads across ranks

Note: This example prioritizes clarity over performance optimization, serving as a learning resource for TMA-based distributed operations.


          init

f938747

Contributor

shubaoyu2 commented Jan 21, 2026

LGTM，and also cc @IonThruster @brandon-yujie-sun @fengxie @hwu36 for review and approve

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet