Fused all-gather+GEMM HBM-buffer kernel for iris.ops by neoblizz · Pull Request #346 · ROCm/iris

neoblizz · 2026-02-03T17:39:22Z

Adds all_gather_matmul_hbm_buffer: a fused kernel that pipelines all-gather and GEMM by splitting workgroups into dedicated fetchers and GEMM workers. Fetchers pull remote A tiles into a local HBM staging buffer and set per-tile ready flags; GEMM WGs spin on flags and compute as tiles arrive, eliminating the full all-gather barrier. Delivers 2.7–3.4× lower latency vs the barrier-based baseline on 8× MI325X.

New kernel

iris/ops/all_gather_matmul_hbm_buffer.py — fetcher/GEMM WG split; k_contiguous and m_contiguous staged-A layouts; optional bias; per-WG tracing via wg_fetch/wg_gemm/wg_gemm_wait event IDs
iris/tracing/events.py — trace event IDs for per-workgroup profiling

API / config changes

iris/x/gather.py — hint vectorization parameter forwarded to _translate()
iris/ops/__init__.py — exports all_gather_matmul_hbm_buffer / all_gather_matmul_hbm_buffer_preamble
iris/ops/config.py — removed unused all_gather_matmul_variant field and dead "push" workspace allocation from all_gather_matmul_preamble

Benchmark & tests

benchmark/ops/bench_all_gather_matmul.py — merged baseline and HBM-buffer variants under @bench.axis("algorithm", ["baseline", "hbm_buffer"]); bench_all_gather_matmul_hbm_buffer.py deleted
tests/ops/test_all_gather_matmul.py — merged correctness tests for both algorithms with shared _make_reference helper; test_all_gather_matmul_hbm_buffer.py deleted

Results (8× AMD MI325X, float16, N=3584, K=8192)

Ranks	MxNxK	Baseline (ms)	HBM Buffer (ms)	Speedup	TFLOPS
2	1024×3584×8192	1.67	0.78	2.1×	77
2	16384×3584×8192	27.8	8.2	3.4×	117
4	16384×3584×8192	27.3	8.6	3.2×	112
8	16384×3584×8192	24.4	8.9	2.7×	108

…yaswann/iris_xops_perf

… use randn in benchmark Agent-Logs-Url: https://github.com/ROCm/iris/sessions/6b5a926e-b45d-43ac-abac-7b8805d2d367 Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>

…dn inputs Agent-Logs-Url: https://github.com/ROCm/iris/sessions/6b5a926e-b45d-43ac-abac-7b8805d2d367 Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>

…iris tensors Agent-Logs-Url: https://github.com/ROCm/iris/sessions/6b5a926e-b45d-43ac-abac-7b8805d2d367 Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>

Copilot · 2026-04-22T20:06:42Z

@copilot can you please address Ryan's comment.

Addressed the performance testing for David's debug_barrier fix — see my reply to the code review thread. For MI350X hardware, we don't have access in this environment; the mi355x configs from the cherry-pick are included but can only be validated when MI350X hardware is available.

…00X) Agent-Logs-Url: https://github.com/ROCm/iris/sessions/306af9bb-b705-4c03-9d49-a31600e6df43 Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>

mawad-amd · 2026-04-22T20:43:38Z

@copilot remove the charts you uploaded and get this PR ready for final review.

…at; add bar chart ws8 Agent-Logs-Url: https://github.com/ROCm/iris/sessions/84eadb5a-429c-47b4-9636-dc3a70726a1a Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>

ryanswann-amd · 2026-05-04T18:27:20Z

@copilot resolve the merge conflicts in this pull request

…chart images Co-authored-by: ryanswann-amd <109695074+ryanswann-amd@users.noreply.github.com>

Copilot · 2026-05-04T18:48:47Z

@copilot remove the charts you uploaded and get this PR ready for final review.

Done in commit 801adb7. All 4 chart images removed from docs/benchmark-results/, the directory deleted, and docs/benchmark-results/*.png added to .gitignore. The branch is also synced with main (merge resolved).

@copilot resolve the merge conflicts in this pull request

Merge conflicts resolved in commit 801adb7. Changes made:

Merged main (iris restructured: iris/x/ → iris.mem.triton.context, iris/tracing/ → iris/host/tracing/, etc.)
Updated all_gather_matmul_hbm_buffer.py to use new API: iris.make_tensor_view, iris.TileView, ctx.gather
Added hint parameter to ctx.gather() in iris/mem/triton/context.py
Updated iris.host.distributed.fd_passing to use new logging imports
All conflicts resolved, ruff checks passing

ryanswann-amd · 2026-05-04T18:50:09Z

@copilot can you expand test coverage for the kernels added in this PR?

Agent-Logs-Url: https://github.com/ROCm/iris/sessions/fc86711f-8150-48e4-85d6-c4c72c9f26b9 Co-authored-by: ryanswann-amd <109695074+ryanswann-amd@users.noreply.github.com>

Copilot · 2026-05-04T19:22:46Z

@copilot can you expand test coverage for the kernels added in this PR?

Expanded in commit 63d84c4. Added 16 new tests across two categories:

Distributed kernel tests:

bfloat16 dtype added to all parametrized correctness tests (baseline, HBM buffer, bias)
M=512 added to HBM buffer shape parametrize
test_all_gather_matmul_hbm_buffer_auto_workspace — verifies workspace=None triggers automatic preamble
test_all_gather_matmul_hbm_buffer_workspace_reuse — verifies two consecutive calls with the same workspace produce identical results
test_all_gather_matmul_hbm_buffer_trace — verifies trace_data is not populated when trace=False

Pure-Python _auto_config unit tests (no distributed required):

test_auto_config_heuristic_validity (5 shapes) — k_per_flag divides K//block_k
test_auto_config_champion_shapes — champion config values returned exactly
test_auto_config_large_m_uses_block_256 / test_auto_config_small_m_uses_block_128
test_auto_config_block_n_always_256 / test_auto_config_block_k_always_64

All 31 tests pass.

ryanswann-amd · 2026-05-06T15:29:53Z

@copilot I identified that the linked diff fixes the CI changes. Can you cherry pick minimal changes to fix the CI error: ryanswann-amd/iris@39c213d...5f30678

…uction Cherry-picked from ryanswann-amd/iris@5f30678. Two bugs caused deadlocks when iris.iris() is constructed/destroyed repeatedly (parametrized tests): 1. NCCL collective ordering: replaced _allgather_paths_tensor (which still uses dist.all_gather NCCL collective) with _allgather_paths_store using the dist.Store key-value API (pure TCP, zero NCCL). 2. Stale socket reuse: added per-process monotonic instance_id to socket paths and eager socket cleanup via SymmetricHeap.close_fd_conns(). Agent-Logs-Url: https://github.com/ROCm/iris/sessions/4824970a-ba41-4384-ab0e-650090f96f71 Co-authored-by: ryanswann-amd <109695074+ryanswann-amd@users.noreply.github.com>

Copilot · 2026-05-06T15:46:56Z

@copilot I identified that the linked diff fixes the CI changes. Can you cherry pick minimal changes to fix the CI error: ryanswann-amd/iris@39c213d...5...

Cherry-picked in commit 677cb6f. Applied the minimal changes from ryanswann-amd/iris@5f30678 to the restructured paths:

iris/host/distributed/fd_passing.py: Replaced _allgather_paths_tensor (NCCL dist.all_gather) with _allgather_paths_store (pure TCP dist.Store API); added instance_id to make_rank_sock_path and setup_fd_infrastructure
iris/host/memory/symmetric_heap.py: Added close_fd_conns() for eager socket cleanup
iris/host/iris.py: Updated __del__ to call heap.close_fd_conns() before allocator cleanup

Add benchmark capabilities for ops.

595423d

github-actions Bot added in-progress We are working on it iris Iris project issue labels Feb 3, 2026

neoblizz and others added 27 commits February 7, 2026 11:03

Merge branch 'main' into neoblizz/iris-xops-perf

8c965a1

Merge conflicts.

ef227b0

Up the tritonBLAS commit.

f132ceb

...

1628a61

Apply Ruff auto-fixes

c26e872

Fix load vectorization and transpose config

3d4c7d7

Apply Ruff auto-fixes

5b02211

Add HBM buffered version

4c3b3f4

Merge branch 'ryaswann/iris_xops_perf' of github.com:ROCm/iris into r…

a301392

…yaswann/iris_xops_perf

Apply Ruff auto-fixes

1f3b9ef

Use workgroup specialized variant

45288ff

Apply Ruff auto-fixes

b2aadcd

Update hbm buffered all gather matmul

7b2321e

Merge branch 'ryaswann/iris_xops_perf' of github.com:ROCm/iris into r…

a4d845f

…yaswann/iris_xops_perf

Apply Ruff auto-fixes

9692222

Add tracing

44ebc97

Merge branch 'ryaswann/iris_xops_perf' of github.com:ROCm/iris into r…

0c2842e

…yaswann/iris_xops_perf

Apply Ruff auto-fixes

11d017a

Add stages to all_gather_matmul_hbm_buffer

ace40d0

Merge branch 'ryaswann/iris_xops_perf' of github.com:ROCm/iris into r…

950c3a0

…yaswann/iris_xops_perf

Apply Ruff auto-fixes

f7612bd

Updates to benchmark and kernel

51bccb5

Merge branch 'ryaswann/iris_xops_perf' of github.com:ROCm/iris into r…

9b71523

…yaswann/iris_xops_perf

Apply Ruff auto-fixes

cbe2aff

Add predictive params, fix pointer overflows, fix race conditions

11d9001

Apply Ruff auto-fixes

3c4cb4d

Merge branch 'neoblizz/iris-xops-perf' into ryaswann/iris_xops_perf

f2f755a

Copilot AI and others added 3 commits April 22, 2026 19:52

Fix: add tl.debug_barrier() before atomic.xchg, fix tests k_per_flag,…

9c99965

… use randn in benchmark Agent-Logs-Url: https://github.com/ROCm/iris/sessions/6b5a926e-b45d-43ac-abac-7b8805d2d367 Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>

Add state.skip() when iris disabled by auto-config, fix benchmark ran…

2dedbce

…dn inputs Agent-Logs-Url: https://github.com/ROCm/iris/sessions/6b5a926e-b45d-43ac-abac-7b8805d2d367 Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>

Use per-tensor Generator for seeding in benchmark, use ctx.randn for …

e42c7a3

…iris tensors Agent-Logs-Url: https://github.com/ROCm/iris/sessions/6b5a926e-b45d-43ac-abac-7b8805d2d367 Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>

Copilot finished work on behalf of mawad-amd April 22, 2026 20:08

Copilot started work on behalf of mawad-amd April 22, 2026 20:12 View session

Add bar chart: iris vs RCCL vs expected for tuned shapes at ws=8 (MI3…

7f163a0

…00X) Agent-Logs-Url: https://github.com/ROCm/iris/sessions/306af9bb-b705-4c03-9d49-a31600e6df43 Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>

Copilot finished work on behalf of mawad-amd April 22, 2026 20:23

Copilot started work on behalf of mawad-amd April 22, 2026 20:38 View session

Fix rccl benchmark: use dist.all_gather+cat(dim=1) for correct K-conc…

95dce96

…at; add bar chart ws8 Agent-Logs-Url: https://github.com/ROCm/iris/sessions/84eadb5a-429c-47b4-9636-dc3a70726a1a Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>

Copilot finished work on behalf of mawad-amd April 22, 2026 20:54

Copilot started work on behalf of ryanswann-amd May 4, 2026 18:27 View session

Merge main: update to new iris API (iris.mem.triton.context), remove …

801adb7

…chart images Co-authored-by: ryanswann-amd <109695074+ryanswann-amd@users.noreply.github.com>

Copilot finished work on behalf of ryanswann-amd May 4, 2026 18:50

Copilot started work on behalf of ryanswann-amd May 4, 2026 18:50 View session

Expand test coverage for all_gather_matmul_hbm_buffer kernel

63d84c4

Agent-Logs-Url: https://github.com/ROCm/iris/sessions/fc86711f-8150-48e4-85d6-c4c72c9f26b9 Co-authored-by: ryanswann-amd <109695074+ryanswann-amd@users.noreply.github.com>

Copilot finished work on behalf of ryanswann-amd May 4, 2026 19:23

Copilot started work on behalf of ryanswann-amd May 6, 2026 15:30 View session

Copilot finished work on behalf of ryanswann-amd May 6, 2026 15:48

mawad-amd approved these changes May 12, 2026

View reviewed changes

ryanswann-amd merged commit e466814 into main May 12, 2026
43 checks passed

ryanswann-amd deleted the neoblizz/iris-xops-perf branch May 12, 2026 04:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fused all-gather+GEMM HBM-buffer kernel for iris.ops#346

Fused all-gather+GEMM HBM-buffer kernel for iris.ops#346
ryanswann-amd merged 75 commits into
mainfrom
neoblizz/iris-xops-perf

neoblizz commented Feb 3, 2026 •

edited by Copilot AI

Loading

Uh oh!

Copilot AI commented Apr 22, 2026

Uh oh!

mawad-amd commented Apr 22, 2026

Uh oh!

ryanswann-amd commented May 4, 2026

Uh oh!

Copilot AI commented May 4, 2026 •

edited

Loading

Uh oh!

ryanswann-amd commented May 4, 2026

Uh oh!

Copilot AI commented May 4, 2026

Uh oh!

ryanswann-amd commented May 6, 2026

Uh oh!

Copilot AI commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

neoblizz commented Feb 3, 2026 • edited by Copilot AI Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New kernel

API / config changes

Benchmark & tests

Results (8× AMD MI325X, float16, N=3584, K=8192)

Uh oh!

Copilot AI commented Apr 22, 2026

Uh oh!

mawad-amd commented Apr 22, 2026

Uh oh!

ryanswann-amd commented May 4, 2026

Uh oh!

Copilot AI commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ryanswann-amd commented May 4, 2026

Uh oh!

Copilot AI commented May 4, 2026

Uh oh!

ryanswann-amd commented May 6, 2026

Uh oh!

Copilot AI commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

neoblizz commented Feb 3, 2026 •

edited by Copilot AI

Loading

Copilot AI commented May 4, 2026 •

edited

Loading