Skip to content

Conversation

@aragorn-guan
Copy link

Add TMA-based distributed all-reduce example (all_reduce_tma.py)

A tutorial example demonstrating TMA usage for distributed all-reduce operations across multiple GPUs.

Key features:

  • Uses TMALDG.1D to load from remote GPU memory via NVSHMEM addresses
  • Uses TMASTG.1D to store to multicast address for broadcasting
  • Supports any input shape by flattening to 1D and tiling linearly
  • Two-stage pipeline overlaps TMA loads across ranks

Note: This example prioritizes clarity over performance optimization, serving as a learning resource for TMA-based distributed operations.

@shubaoyu2
Copy link
Contributor

LGTM,and also cc @IonThruster @brandon-yujie-sun @fengxie @hwu36 for review and approve

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants