Skip to content

[CuTeDSL] Distributed example, using TMA load to access remote memory rank-by-rank, reducing in cta, broadcast result to all ranks by multimem TMA store#2970

Merged
Junkai-Wu merged 1 commit intoNVIDIA:mainfrom
aragorn-guan:tma_distribute_example
Feb 11, 2026

Conversation

@aragorn-guan
Copy link
Copy Markdown
Contributor

@aragorn-guan aragorn-guan commented Jan 21, 2026

Add TMA-based distributed all-reduce example (all_reduce_tma.py)

A tutorial example demonstrating TMA usage for distributed all-reduce operations across multiple GPUs.

Key features:

  • Uses 1D TMA load to load from remote GPU memory via NVSHMEM addresses
  • Uses 1D TMA store to store to multicast address for broadcasting
  • Supports any input shape by flattening to 1D and tiling linearly
  • Two-stage pipeline overlaps TMA loads across ranks

Note: This example prioritizes clarity over performance optimization, serving as a learning resource for TMA-based distributed operations.

@shubaoyu2
Copy link
Copy Markdown
Contributor

LGTM,and also cc @IonThruster @brandon-yujie-sun @fengxie @hwu36 for review and approve

@Junkai-Wu Junkai-Wu changed the title [CuTeDSL] Distributed example, using TMALDG to access remote memory rank-by-rank, reducing in cta, broadcast result to all ranks by multimem TMASTG [CuTeDSL] Distributed example, using TMA load to access remote memory rank-by-rank, reducing in cta, broadcast result to all ranks by multimem TMASTG Feb 11, 2026
@Junkai-Wu Junkai-Wu changed the title [CuTeDSL] Distributed example, using TMA load to access remote memory rank-by-rank, reducing in cta, broadcast result to all ranks by multimem TMASTG [CuTeDSL] Distributed example, using TMA load to access remote memory rank-by-rank, reducing in cta, broadcast result to all ranks by multimem TMA store Feb 11, 2026
@Junkai-Wu Junkai-Wu merged commit 8dbce01 into NVIDIA:main Feb 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants