Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
182 commits
Select commit Hold shift + click to select a range
d44394c
wip back of sdma integration
dsidler Nov 6, 2025
c50e761
Apply Ruff auto-fixes
github-actions[bot] Nov 6, 2025
2f7bc5e
message passing example working
dsidler Nov 6, 2025
5e38fd6
Merge branch 'dev/dasidler/sdma' of https://github.com/ROCm/iris into…
dsidler Nov 6, 2025
759f662
Apply Ruff auto-fixes
github-actions[bot] Nov 6, 2025
ad7769d
update put example to use ce
dsidler Nov 7, 2025
b8862cc
update api calls
dsidler Nov 7, 2025
75c5626
update submodule
dsidler Nov 7, 2025
2b228ab
Merge branch 'dev/dasidler/sdma' of https://github.com/ROCm/iris into…
dsidler Nov 7, 2025
e3aef16
fix merge
dsidler Nov 7, 2025
df04547
Apply Ruff auto-fixes
github-actions[bot] Nov 7, 2025
c5e4735
wip fixed wrap into ring when placing
dsidler Dec 5, 2025
ea17dd6
Merge branch 'dev/dasidler/sdma' of https://github.com/ROCm/iris into…
dsidler Dec 5, 2025
5362318
to_rank 7 working
dsidler Dec 5, 2025
a6b1d40
Apply Ruff auto-fixes
github-actions[bot] Dec 10, 2025
224511f
Merge branch 'main' into dev/dasidler/sdma
dsidler Jan 14, 2026
400b5b7
use triton commit with fix
dsidler Jan 14, 2026
d06cb72
Apply Ruff auto-fixes
github-actions[bot] Jan 14, 2026
b2e358b
send to all ranks but always same stride
dsidler Jan 20, 2026
b245899
update submodule
dsidler Jan 20, 2026
0e7fbd6
Merge branch 'dev/dasidler/sdma' of https://github.com/ROCm/iris into…
dsidler Jan 20, 2026
1ee4c58
use 32B copy packets workaround
dsidler Jan 30, 2026
1c384c3
submodule update
dsidler Jan 30, 2026
595423d
Add benchmark capabilities for ops.
neoblizz Feb 3, 2026
8c965a1
Merge branch 'main' into neoblizz/iris-xops-perf
neoblizz Feb 7, 2026
ef227b0
Merge conflicts.
neoblizz Feb 7, 2026
f132ceb
Up the tritonBLAS commit.
neoblizz Feb 7, 2026
1628a61
...
neoblizz Feb 10, 2026
c26e872
Apply Ruff auto-fixes
github-actions[bot] Feb 10, 2026
3d4c7d7
Fix load vectorization and transpose config
ryanswann-amd Feb 11, 2026
5b02211
Apply Ruff auto-fixes
github-actions[bot] Feb 11, 2026
4c3b3f4
Add HBM buffered version
ryanswann-amd Feb 11, 2026
a301392
Merge branch 'ryaswann/iris_xops_perf' of github.com:ROCm/iris into r…
ryanswann-amd Feb 11, 2026
1f3b9ef
Apply Ruff auto-fixes
github-actions[bot] Feb 11, 2026
45288ff
Use workgroup specialized variant
ryanswann-amd Feb 13, 2026
b2aadcd
Apply Ruff auto-fixes
github-actions[bot] Feb 13, 2026
7b2321e
Update hbm buffered all gather matmul
ryanswann-amd Feb 16, 2026
a4d845f
Merge branch 'ryaswann/iris_xops_perf' of github.com:ROCm/iris into r…
ryanswann-amd Feb 16, 2026
9692222
Apply Ruff auto-fixes
github-actions[bot] Feb 16, 2026
44ebc97
Add tracing
ryanswann-amd Feb 16, 2026
0c2842e
Merge branch 'ryaswann/iris_xops_perf' of github.com:ROCm/iris into r…
ryanswann-amd Feb 17, 2026
11d017a
Apply Ruff auto-fixes
github-actions[bot] Feb 17, 2026
ace40d0
Add stages to all_gather_matmul_hbm_buffer
ryanswann-amd Feb 17, 2026
950c3a0
Merge branch 'ryaswann/iris_xops_perf' of github.com:ROCm/iris into r…
ryanswann-amd Feb 17, 2026
f7612bd
Apply Ruff auto-fixes
github-actions[bot] Feb 17, 2026
51bccb5
Updates to benchmark and kernel
ryanswann-amd Feb 17, 2026
9b71523
Merge branch 'ryaswann/iris_xops_perf' of github.com:ROCm/iris into r…
ryanswann-amd Feb 17, 2026
cbe2aff
Apply Ruff auto-fixes
github-actions[bot] Feb 17, 2026
11d9001
Add predictive params, fix pointer overflows, fix race conditions
Mar 3, 2026
3c4cb4d
Apply Ruff auto-fixes
github-actions[bot] Mar 3, 2026
f2f755a
Merge branch 'neoblizz/iris-xops-perf' into ryaswann/iris_xops_perf
ryanswann-amd Mar 3, 2026
77eff5b
Reverse 2D block translate
Mar 3, 2026
dcafd2a
Properly use iris tracing APIs
Mar 3, 2026
6fdad6d
Apply Ruff auto-fixes
github-actions[bot] Mar 3, 2026
08755b7
Remove test.sh
Mar 3, 2026
0224866
use window command
dsidler Mar 4, 2026
40c228a
use new acquire function
dsidler Mar 5, 2026
34d4ffc
update submodule
dsidler Mar 5, 2026
c8d4b46
Apply Ruff auto-fixes
github-actions[bot] Mar 5, 2026
88f7767
All gather matmul with improved performance. (#415)
ryanswann-amd Mar 5, 2026
53f1a20
move padding code
dsidler Mar 5, 2026
099a84c
update submodule for nop packet
dsidler Mar 5, 2026
75b55b2
enable flat copy
dsidler Mar 5, 2026
17d0696
Merge branch 'dev/dasidler/sdma' of https://github.com/ROCm/iris into…
dsidler Mar 5, 2026
e5a38dd
Apply Ruff auto-fixes
github-actions[bot] Mar 5, 2026
02d08c9
Merge branch 'main' into dev/dasidler/sdma
dsidler Mar 5, 2026
0b6ff1a
clean up
dsidler Mar 5, 2026
f558293
Fix CI: restore vectorization hints, align tritonBLAS versions, remov…
ryanswann-amd Mar 6, 2026
e5dd77f
Merge main into neoblizz/iris-xops-perf
ryanswann-amd Mar 6, 2026
477b472
Fix CI: increase default N to match FusedConfig block_size_n=256
ryanswann-amd Mar 6, 2026
76cc30d
Revert "Fix CI: increase default N to match FusedConfig block_size_n=…
ryanswann-amd Mar 6, 2026
9743b13
Remove unnecessary block size assertions — Triton handles masking
ryanswann-amd Mar 6, 2026
a86dc04
Initial plan
Copilot Mar 11, 2026
445b25c
Add vectorization hints and tests for HBM buffer all-gather matmul
Copilot Mar 12, 2026
2f0099f
Add vectorization hints and tests for HBM buffer all-gather matmul (#…
ryanswann-amd Mar 12, 2026
39c213d
Merge branch 'main' into neoblizz/iris-xops-perf
ryanswann-amd Mar 16, 2026
bfe4548
add copy engine support to fused gemm-allscatter
dsidler Mar 18, 2026
27040c8
Apply Ruff auto-fixes
github-actions[bot] Mar 18, 2026
bf55b6d
switch to acquire_fadd
dsidler Mar 24, 2026
aef1411
update submodule
dsidler Mar 24, 2026
2cea9f7
initial host initiated sdma
dsidler Mar 24, 2026
53bfeaa
refactor&cleanup
dsidler Mar 24, 2026
831be93
Merge branch 'dev/dasidler/sdma' of https://github.com/ROCm/iris into…
dsidler Mar 24, 2026
d191276
Apply Ruff auto-fixes
github-actions[bot] Mar 24, 2026
71feba4
Merge branch 'dev/dasidler/sdma' into dev/dasidler/sdma-benchmark
dsidler Mar 24, 2026
77a4d1d
initial copy engine ag-gemm
dsidler Mar 25, 2026
bb22e82
Apply Ruff auto-fixes
github-actions[bot] Mar 25, 2026
a44adde
persisten version
dsidler Mar 26, 2026
9891843
use same benchmark initialization to improve gemm time
dsidler Mar 26, 2026
671a622
Merge branch 'dev/dasidler/sdma-benchmark' of https://github.com/ROCm…
dsidler Mar 26, 2026
eded63e
Apply Ruff auto-fixes
github-actions[bot] Mar 26, 2026
6c8cdcc
version that performs reasonably
dsidler Mar 30, 2026
d8d136a
avoid local copy of a
dsidler Mar 30, 2026
9d4c7f9
Merge branch 'dev/dasidler/sdma-benchmark' of https://github.com/ROCm…
dsidler Mar 30, 2026
dd501a3
Apply Ruff auto-fixes
github-actions[bot] Mar 30, 2026
28f4dc8
add constraints
dsidler Mar 30, 2026
92608d4
adding initial matmul_all_gather
dsidler Mar 30, 2026
acedf8c
device-initiated
dsidler Mar 30, 2026
d88eb87
device initiated
dsidler Mar 30, 2026
39ac9d7
Merge branch 'dev/dasidler/sdma-benchmark' of https://github.com/ROCm…
dsidler Mar 30, 2026
0b763ca
Apply Ruff auto-fixes
github-actions[bot] Mar 30, 2026
d140349
initial derive version for copy engine
dsidler Mar 31, 2026
fc4601d
fix batch_id inc
dsidler Apr 1, 2026
29dafac
fix flag allocation
dsidler Apr 1, 2026
01932a1
fix host gemm-ag
dsidler Apr 2, 2026
8764190
importing derived params
dsidler Apr 2, 2026
08cd061
Merge branch 'dev/dasidler/sdma-benchmark' of https://github.com/ROCm…
dsidler Apr 2, 2026
933e531
fix host gemm+ag, add gemm only
dsidler Apr 3, 2026
b1c2417
update sweep
dsidler Apr 3, 2026
cac3ee0
gemm-ag host switch to m-tile batch, add tracing
dsidler Apr 3, 2026
623325c
Apply Ruff auto-fixes
github-actions[bot] Apr 3, 2026
e190d30
reuse locks btw iterations
dsidler Apr 6, 2026
654d81a
add host-initiated to iris api
dsidler Apr 6, 2026
69a25c8
add success flag to ag-gemm, update sweep script
dsidler Apr 6, 2026
b6dfb78
gemm-ag benchmark updates
dsidler Apr 6, 2026
39a840f
Merge branch 'dev/dasidler/sdma-benchmark' of https://github.com/ROCm…
dsidler Apr 6, 2026
ef26c0d
reuse locks
dsidler Apr 7, 2026
e0d85d5
hbm-buf fix race condition, reuse locks
dsidler Apr 7, 2026
bad3422
Initial plan for PR cleanup
Copilot Apr 8, 2026
2a9f31a
Cleanup PR: address reviewer feedback
Copilot Apr 8, 2026
98d25bf
Clarify bias handling in matmul_reduce_scatter: raise NotImplementedE…
Copilot Apr 8, 2026
196bef7
Merge branch 'main' into neoblizz/iris-xops-perf
Copilot Apr 8, 2026
f4b4e75
Sync with main, remove unneeded scripts, minimize PR footprint
Copilot Apr 8, 2026
9d29d8c
Port HBM buffer benchmark to iris.bench, remove helper scripts
Copilot Apr 8, 2026
2c8b226
Replace shmem with ctx in hbm_buffer kernel and tests
Copilot Apr 9, 2026
1f7f6f1
Updated copilot instructions: you have GPUs, use them
mawad-amd Apr 9, 2026
9999273
Add benchmark comparison plots for HBM buffer vs baseline
Copilot Apr 9, 2026
e6b7114
Merge benchmarks and tests, remove dead code
Copilot Apr 9, 2026
5fac461
Update benchmark comparison plots with MxNxK x-axis labels
Copilot Apr 9, 2026
184331c
Extend trace events with categorized ID ranges and fix tracing abuse
mawad-amd Apr 9, 2026
1b6df88
Apply Ruff auto-fixes
github-actions[bot] Apr 9, 2026
6b70059
Bump trace schema version to 1.2 for new event categories
mawad-amd Apr 9, 2026
8607e38
Add RCCL baseline and rename algorithms to one_shot/prefetch
mawad-amd Apr 9, 2026
63c978b
Fix RCCL benchmark: use regular CUDA memory, not iris symmetric heap
mawad-amd Apr 9, 2026
6a8ad6b
Fix RCCL benchmark: use dist.get_world_size() instead of ctx
mawad-amd Apr 9, 2026
5027f67
add args to matmul
dsidler Apr 9, 2026
292ee11
Update HBM buffer kernel defaults and benchmark for parameter sweep
Copilot Apr 9, 2026
6979787
Update benchmark plots with new vs previous defaults comparison
Copilot Apr 9, 2026
826e78f
add flag_iteration
dsidler Apr 10, 2026
dd50602
more robust sweep script
dsidler Apr 10, 2026
a058856
merge sweep script
dsidler Apr 10, 2026
11d36b8
switch to tritonblas
dsidler Apr 10, 2026
02ea2b6
Fix preamble FusedConfig() defaults and add shape-adaptive auto-config
ryanswann-amd Apr 11, 2026
64a631f
Fix collective ordering deadlock in fd_passing at ws<8
ryanswann-amd Apr 11, 2026
7d3f476
Apply Ruff auto-fixes
github-actions[bot] Apr 11, 2026
b59bbb2
use selector for gemm-ag host
dsidler Apr 13, 2026
cf24173
reuse selector gemm-ag host, cannot change block-size
dsidler Apr 13, 2026
2f9e2a7
switch to tritonblas temporarily
dsidler Apr 14, 2026
7786de8
use m-tile-per-batch=group-size
dsidler Apr 14, 2026
4545b3e
merged sweep
dsidler Apr 14, 2026
3942b10
host issue batch0 ag-gemm
dsidler Apr 14, 2026
053f9c8
remove copy hack
dsidler Apr 14, 2026
8d9f2dd
update heuristic gemm-ag host
dsidler Apr 15, 2026
d1c4e3d
do sdma quiet gemm-ag
dsidler Apr 15, 2026
0d7fed3
improve plotting
dsidler Apr 15, 2026
2c572c3
Apply Ruff auto-fixes
github-actions[bot] Apr 15, 2026
ef0a173
Port auto-config system from ryanswann-amd/iris feature/auto-config-x…
Copilot Apr 15, 2026
2528e8e
Add docs/benchmark-results/ to .gitignore
Copilot Apr 15, 2026
caed8a5
Remove accidentally committed .github/agents and benchmark images
Copilot Apr 15, 2026
b0f8ff6
using wave- and xcd-aware tile transfers
dsidler Apr 16, 2026
59bd83a
gemm-ag device wave-xcd aware
dsidler Apr 16, 2026
55884db
add metadata
dsidler Apr 17, 2026
0a66433
Merge branch 'dev/dasidler/sdma-benchmark' of https://github.com/ROCm…
dsidler Apr 17, 2026
86a979d
update fused gemm-ag
dsidler Apr 17, 2026
19eb057
add more scripts
dsidler Apr 17, 2026
7f70cb8
change expected alloc
dsidler Apr 17, 2026
6e7ba74
Merge branch 'neoblizz/iris-xops-perf' into dev/dasidler/sdma-benchmark
dsidler Apr 17, 2026
d52ca11
switch to new benchmarks
dsidler Apr 18, 2026
a3446aa
gemm-ag validation
dsidler Apr 20, 2026
8a1043f
validator updates and fixes
dsidler Apr 20, 2026
5323d95
Apply Ruff auto-fixes
github-actions[bot] Apr 20, 2026
c52e5ed
ag-gemm fixes, tritonblas+rccl
dsidler Apr 21, 2026
6216a32
change allocations in bench
dsidler Apr 22, 2026
e53c532
alloc updates, fix hbm buf
dsidler Apr 22, 2026
43103b8
Merge branch 'dev/dasidler/sdma-benchmark' of https://github.com/ROCm…
dsidler Apr 22, 2026
db60651
Apply Ruff auto-fixes
github-actions[bot] Apr 22, 2026
3623b99
include ag-gemm baseline in benchmark
dsidler Apr 22, 2026
9f2066f
no host quiet needed
dsidler Apr 22, 2026
c5caec2
refactor matmul-ag
dsidler Apr 23, 2026
89227df
refactor ag-gemm
dsidler Apr 23, 2026
a04fda9
Merge branch 'dev/dasidler/sdma-benchmark' of https://github.com/ROCm…
dsidler Apr 23, 2026
d790a3f
Apply Ruff auto-fixes
github-actions[bot] Apr 23, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 25 additions & 38 deletions .github/copilot-instructions.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,61 +12,48 @@ Iris is a Triton-based framework for Remote Memory Access (RMA) operations on AM

**FOLLOW THESE INSTRUCTIONS EXACTLY. Reference these instructions first before using search or bash commands.**

## Prerequisites
## GPU Environment

- **GPU**: AMD GPUs with ROCm compatibility (tested on MI300X, MI350X & MI355X)
> **Note**: See below for instructions on development without AMD GPU access
- **ROCm/HIP Toolkit**: Required for building C++/HIP components
- **Docker/Apptainer**: Recommended for containerized development
**You have AMD GPU access.** Your runner has 8x MI325X GPUs (gfx942) with ROCm installed. Do NOT skip GPU operations or say you lack GPU access. You can and should:

## Build
- Run benchmarks and post results (charts, tables) in PR comments
- Run tests with `torchrun --nproc_per_node=N` for multi-GPU tests
- Use `rocm-smi` to verify GPU status
- Run `python -c "import torch; print(torch.cuda.device_count())"` to confirm GPU count

### Docker Development Environment (Recommended)
When asked to run a benchmark, **run it and post the output**. Do not say you cannot.

### Running multi-GPU tests and benchmarks

Multi-GPU tests require `torch.distributed` initialization before pytest:
```bash
# Build and start development container (takes 45-60 minutes - NEVER CANCEL)
docker compose up --build -d
# Single GPU
pytest tests/unittests/ -v --tb=short

# Attach to running container
docker attach iris-dev
# Multi-GPU (N = number of GPUs)
torchrun --nproc_per_node=N -m pytest tests/ -v --tb=short

# Install Iris in development mode
cd iris && pip install -e ".[dev]"
# Benchmarks use iris.bench framework
torchrun --nproc_per_node=8 benchmark/ops/bench_<name>.py
```

### Alternative Docker Setup
```bash
# Build Docker image manually
./docker/build.sh <image-name> # Takes 45-60 minutes
### iris.bench framework

# Run container
./docker/run.sh <image-name>
Benchmarks use the declarative `iris.bench` framework. See existing `benchmark/ops/bench_*.py` files for examples. Output includes latency, throughput, and bandwidth tables. When posting benchmark results in PR comments, format as markdown tables.

# Install Iris
cd iris && pip install -e ".[dev]"
```
## Prerequisites

### Apptainer Setup
```bash
# Build and run Apptainer image
./apptainer/build.sh
./apptainer/run.sh
- **GPU**: AMD GPUs with ROCm compatibility (tested on MI300X, MI325X, MI350X & MI355X)
- **ROCm/HIP Toolkit**: Required for building C++/HIP components
- **Docker/Apptainer**: Recommended for containerized development

# Install Iris
pip install -e ".[dev]"
```
## Build

### Local Development (Not Recommended)
iris is already installed in your environment via `pip install -e .` in the setup steps. You do not need to build or install anything. If you need to reinstall after modifying `setup.py` or C extensions:
```bash
# Requires ROCm/HIP toolkit installation
pip install -e ".[dev]"
```

### Development Without AMD GPU
If you don't have access to AMD GPUs, you can still contribute to the project:
- **Code Editing**: Start editing code directly in your local environment
- **CI Testing**: The project has comprehensive CI pipelines that will test your changes automatically. You can check the CI logs if your changes fail to understand what went wrong.
- **Local Validation**: Run linting and formatting locally: `ruff check . --fix && ruff format .`

## Run

### Testing
Expand Down
8 changes: 7 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,8 @@ omni*.pdf
slurm*.out

*.egg-info
*.backup
*.with_chunked

examples/gemm/results/*
asm/
Expand Down Expand Up @@ -57,4 +59,8 @@ gpucore.*
logs/
*.cap
hsakmt_counters.csv
core
core
.intellikit/
.github/agents/docs/benchmark-results/
.github/agents/
docs/benchmark-results/*.png
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[submodule "ext/shader_sdma"]
path = ext/shader_sdma
url = https://github.com/AARInternal/shader_sdma.git
Empty file.
Loading