Skip to content

Fused all-gather+GEMM HBM-buffer kernel for iris.ops#346

Merged
ryanswann-amd merged 75 commits into
mainfrom
neoblizz/iris-xops-perf
May 12, 2026
Merged

Fused all-gather+GEMM HBM-buffer kernel for iris.ops#346
ryanswann-amd merged 75 commits into
mainfrom
neoblizz/iris-xops-perf

Conversation

@neoblizz
Copy link
Copy Markdown
Member

@neoblizz neoblizz commented Feb 3, 2026

Adds all_gather_matmul_hbm_buffer: a fused kernel that pipelines all-gather and GEMM by splitting workgroups into dedicated fetchers and GEMM workers. Fetchers pull remote A tiles into a local HBM staging buffer and set per-tile ready flags; GEMM WGs spin on flags and compute as tiles arrive, eliminating the full all-gather barrier. Delivers 2.7–3.4× lower latency vs the barrier-based baseline on 8× MI325X.

New kernel

  • iris/ops/all_gather_matmul_hbm_buffer.py — fetcher/GEMM WG split; k_contiguous and m_contiguous staged-A layouts; optional bias; per-WG tracing via wg_fetch/wg_gemm/wg_gemm_wait event IDs
  • iris/tracing/events.py — trace event IDs for per-workgroup profiling

API / config changes

  • iris/x/gather.pyhint vectorization parameter forwarded to _translate()
  • iris/ops/__init__.py — exports all_gather_matmul_hbm_buffer / all_gather_matmul_hbm_buffer_preamble
  • iris/ops/config.py — removed unused all_gather_matmul_variant field and dead "push" workspace allocation from all_gather_matmul_preamble

Benchmark & tests

  • benchmark/ops/bench_all_gather_matmul.py — merged baseline and HBM-buffer variants under @bench.axis("algorithm", ["baseline", "hbm_buffer"]); bench_all_gather_matmul_hbm_buffer.py deleted
  • tests/ops/test_all_gather_matmul.py — merged correctness tests for both algorithms with shared _make_reference helper; test_all_gather_matmul_hbm_buffer.py deleted

Results (8× AMD MI325X, float16, N=3584, K=8192)

Ranks MxNxK Baseline (ms) HBM Buffer (ms) Speedup TFLOPS
2 1024×3584×8192 1.67 0.78 2.1× 77
2 16384×3584×8192 27.8 8.2 3.4× 117
4 16384×3584×8192 27.3 8.6 3.2× 112
8 16384×3584×8192 24.4 8.9 2.7× 108

TFLOPS
Latency

@github-actions github-actions Bot added in-progress We are working on it iris Iris project issue labels Feb 3, 2026
Copilot AI and others added 3 commits April 22, 2026 19:52
… use randn in benchmark

Agent-Logs-Url: https://github.com/ROCm/iris/sessions/6b5a926e-b45d-43ac-abac-7b8805d2d367

Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
…dn inputs

Agent-Logs-Url: https://github.com/ROCm/iris/sessions/6b5a926e-b45d-43ac-abac-7b8805d2d367

Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
…iris tensors

Agent-Logs-Url: https://github.com/ROCm/iris/sessions/6b5a926e-b45d-43ac-abac-7b8805d2d367

Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 22, 2026

@copilot can you please address Ryan's comment.

Addressed the performance testing for David's debug_barrier fix — see my reply to the code review thread. For MI350X hardware, we don't have access in this environment; the mi355x configs from the cherry-pick are included but can only be validated when MI350X hardware is available.

@mawad-amd
Copy link
Copy Markdown
Collaborator

@copilot remove the charts you uploaded and get this PR ready for final review.

…at; add bar chart ws8

Agent-Logs-Url: https://github.com/ROCm/iris/sessions/84eadb5a-429c-47b4-9636-dc3a70726a1a

Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
@ryanswann-amd
Copy link
Copy Markdown
Collaborator

@copilot resolve the merge conflicts in this pull request

…chart images

Co-authored-by: ryanswann-amd <109695074+ryanswann-amd@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI commented May 4, 2026

@copilot remove the charts you uploaded and get this PR ready for final review.

Done in commit 801adb7. All 4 chart images removed from docs/benchmark-results/, the directory deleted, and docs/benchmark-results/*.png added to .gitignore. The branch is also synced with main (merge resolved).

@copilot resolve the merge conflicts in this pull request

Merge conflicts resolved in commit 801adb7. Changes made:

  • Merged main (iris restructured: iris/x/iris.mem.triton.context, iris/tracing/iris/host/tracing/, etc.)
  • Updated all_gather_matmul_hbm_buffer.py to use new API: iris.make_tensor_view, iris.TileView, ctx.gather
  • Added hint parameter to ctx.gather() in iris/mem/triton/context.py
  • Updated iris.host.distributed.fd_passing to use new logging imports
  • All conflicts resolved, ruff checks passing

@ryanswann-amd
Copy link
Copy Markdown
Collaborator

@copilot can you expand test coverage for the kernels added in this PR?

Agent-Logs-Url: https://github.com/ROCm/iris/sessions/fc86711f-8150-48e4-85d6-c4c72c9f26b9

Co-authored-by: ryanswann-amd <109695074+ryanswann-amd@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI commented May 4, 2026

@copilot can you expand test coverage for the kernels added in this PR?

Expanded in commit 63d84c4. Added 16 new tests across two categories:

Distributed kernel tests:

  • bfloat16 dtype added to all parametrized correctness tests (baseline, HBM buffer, bias)
  • M=512 added to HBM buffer shape parametrize
  • test_all_gather_matmul_hbm_buffer_auto_workspace — verifies workspace=None triggers automatic preamble
  • test_all_gather_matmul_hbm_buffer_workspace_reuse — verifies two consecutive calls with the same workspace produce identical results
  • test_all_gather_matmul_hbm_buffer_trace — verifies trace_data is not populated when trace=False

Pure-Python _auto_config unit tests (no distributed required):

  • test_auto_config_heuristic_validity (5 shapes) — k_per_flag divides K//block_k
  • test_auto_config_champion_shapes — champion config values returned exactly
  • test_auto_config_large_m_uses_block_256 / test_auto_config_small_m_uses_block_128
  • test_auto_config_block_n_always_256 / test_auto_config_block_k_always_64

All 31 tests pass.

@ryanswann-amd
Copy link
Copy Markdown
Collaborator

@copilot I identified that the linked diff fixes the CI changes. Can you cherry pick minimal changes to fix the CI error: ryanswann-amd/iris@39c213d...5f30678

…uction

Cherry-picked from ryanswann-amd/iris@5f30678. Two bugs caused deadlocks
when iris.iris() is constructed/destroyed repeatedly (parametrized tests):

1. NCCL collective ordering: replaced _allgather_paths_tensor (which still
   uses dist.all_gather NCCL collective) with _allgather_paths_store using
   the dist.Store key-value API (pure TCP, zero NCCL).

2. Stale socket reuse: added per-process monotonic instance_id to socket
   paths and eager socket cleanup via SymmetricHeap.close_fd_conns().

Agent-Logs-Url: https://github.com/ROCm/iris/sessions/4824970a-ba41-4384-ab0e-650090f96f71

Co-authored-by: ryanswann-amd <109695074+ryanswann-amd@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI commented May 6, 2026

@copilot I identified that the linked diff fixes the CI changes. Can you cherry pick minimal changes to fix the CI error: ryanswann-amd/iris@39c213d...5...

Cherry-picked in commit 677cb6f. Applied the minimal changes from ryanswann-amd/iris@5f30678 to the restructured paths:

  • iris/host/distributed/fd_passing.py: Replaced _allgather_paths_tensor (NCCL dist.all_gather) with _allgather_paths_store (pure TCP dist.Store API); added instance_id to make_rank_sock_path and setup_fd_infrastructure
  • iris/host/memory/symmetric_heap.py: Added close_fd_conns() for eager socket cleanup
  • iris/host/iris.py: Updated __del__ to call heap.close_fd_conns() before allocator cleanup

@ryanswann-amd ryanswann-amd merged commit e466814 into main May 12, 2026
43 checks passed
@ryanswann-amd ryanswann-amd deleted the neoblizz/iris-xops-perf branch May 12, 2026 04:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

in-progress We are working on it iris Iris project issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants