`iris-ccl` Iris-based Collective Communications Library (CCL) #276

neoblizz · 2025-11-01T05:17:02Z

Summary

This pull request introduces a new Collective Communication Library (CCL) for Iris, providing standalone collective primitives such as all-to-all communication, along with supporting infrastructure and documentation updates. The changes focus on enabling high-performance distributed collective operations that match PyTorch's RCCL/NCCL interface, including benchmarking and validation tools.

Collective Communication Library Implementation

Added the new iris/ccl package, including the all_to_all collective operation (iris/ccl/all_to_all.py) and a configuration structure for kernel tuning (iris/ccl/config.py). These modules provide a flexible and efficient interface for distributed tensor communication. [1] [2]
Introduced the top-level iris/ccl/__init__.py to expose the collective primitives and configuration, matching PyTorch's interface for easy adoption.

Benchmarking and Validation Tools

Added a comprehensive benchmark script examples/ccl/benchmark.py to measure bandwidth and validate correctness of the all-to-all operation, supporting multiple datatypes and configurable parameters.

Containerization and Environment Setup

Created a new Dockerfile docker/Dockerfile.ccl to provide a ready-to-use environment for CCL development and validation, including installation of dependencies, Triton, ROCm tools, and entrypoint scripts for testing.

Documentation Updates

Updated examples/README.md to document the new ccl directory and provide usage instructions for benchmarking the all-to-all collective operation. [1] [2]

Test Infrastructure

Added a test package initializer for future CCL tests (tests/ccl/__init__.py).

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: neoblizz <[email protected]>

Copilot

Pull Request Overview

This PR introduces the iris-ccl (Collective Communication Library) module with an all-to-all collective operation. The implementation provides a standalone collective primitive matching PyTorch's RCCL/NCCL interface, enabling efficient multi-rank tensor exchange operations.

Adds iris.ccl module with all_to_all collective operation and Config dataclass for kernel parameters
Implements persistent kernel using Triton with remote PUT operations for cross-rank communication
Includes comprehensive test suite, benchmark utilities, and Docker support for validation

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
iris/ccl/init.py	Module initialization exposing all_to_all and Config
iris/ccl/config.py	Configuration dataclass with auto-detection of XCD count and validation
iris/ccl/all_to_all.py	Core all-to-all implementation using persistent Triton kernel with remote PUTs
tests/ccl/init.py	Test module initialization
tests/ccl/test_all_to_all.py	Parametrized test suite for all-to-all with various dtypes and sizes
examples/ccl/benchmark.py	Benchmark script with validation and performance measurement
examples/README.md	Documentation update adding ccl examples
docker/Dockerfile.ccl	Docker configuration for ccl validation environment
examples/17_gemm_one_shot_all_reduce_pc/gemm_one_shot_all_reduce_pc.py	Synchronization bug fix replacing xchg with add and improving memory fences

tests/ccl/test_all_to_all.py

Copilot · 2025-11-01T05:20:34Z

iris/ccl/all_to_all.py

+            if target_rank == cur_rank:
+                # Local path: copy input[cur_rank] chunk to output[cur_rank] chunk
+                data = tl.load(input_ptr + input_offset_send, mask=mask)
+                output_offset_local = rm[:, None] * stride_out_m + (rn[None, :] + cur_rank * N) * stride_out_n


The .wt (write-through) cache modifier is used here but not for remote PUTs on line 97-104. Consider adding a comment explaining why write-through caching is only appropriate for local stores and not remote operations.

Suggested change

output_offset_local = rm[:, None] * stride_out_m + (rn[None, :] + cur_rank * N) * stride_out_n

output_offset_local = rm[:, None] * stride_out_m + (rn[None, :] + cur_rank * N) * stride_out_n

# Use write-through cache modifier only for local stores.

# Write-through caching is not used for remote PUTs because remote memory operations

# may not support cache modifiers and could result in undefined behavior or inefficiency.

examples/ccl/all_to_all/benchmark.py

Copilot · 2025-11-01T05:20:35Z

docker/Dockerfile.ccl

+
+# Clone and install Triton
+WORKDIR $TRITON_PATH
+RUN git clone https://github.com/triton-lang/triton.git $TRITON_PATH


The git commit hash should include a comment explaining why this specific Triton commit is being used (e.g., compatibility requirements, specific feature needs, or bug fixes).

Suggested change

RUN git clone https://github.com/triton-lang/triton.git $TRITON_PATH

RUN git clone https://github.com/triton-lang/triton.git $TRITON_PATH

# Pin Triton to commit dd58234 for ROCm 6.3 compatibility and MI300X/MI350X support.

examples/ccl/benchmark.py

iris/ccl/all_to_all.py

examples/ccl/benchmark.py

Co-authored-by: Copilot <[email protected]>

mawad-amd

Looking good so far. Thanks for adding this.

iris/ccl/all_to_all.py

mawad-amd · 2025-11-04T01:58:27Z

benchmark/ccl/all_to_all/benchmark.py

+
+def _worker(local_rank: int, world_size: int, init_url: str, args: dict):
+    """Worker function for PyTorch distributed execution."""
+    backend = "nccl" if torch.cuda.is_available() else "gloo"


I would really like to separate timing/benchmarking code into the benchmarking directory and have this as an example with no timing or anything. We can discuss if this would not be ideal but I could see a wrapper around all_to_all.

I don't understand how that would work.

… hint.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Pull Request Overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 15 comments.

iris/ccl/config.py

Copilot · 2025-11-09T17:50:02Z

tests/ccl/test_all_to_all.py

+    if not dist.is_initialized():
+        pytest.skip("torch.distributed not initialized")
+
+    heap_size = 2**33  # 1GB


The comment states "1GB" but the heap size is 2^33 bytes which equals 8GB. Either the comment should be updated to "8GB" or the heap size should be changed to 2^30 for 1GB.

Suggested change

heap_size = 2**33 # 1GB

heap_size = 2**30 # 1GB

Copilot · 2025-11-09T17:50:02Z

iris/ccl/config.py

+    all_reduce_variant: str = "atomic"
+    all_reduce_distribution: int = 0
+    all_reduce_num_rings: int = 1
+    all_reduce_ring_slice_n: int | None = None


The type hint int | None uses Python 3.10+ union syntax. For compatibility with Python 3.9 and earlier, either use Optional[int] from typing (which is already imported at line 10 for other purposes in this file), or add from __future__ import annotations at the top of the file. Consider using Optional[int] to maintain consistency with common Python practices.

Suggested change

all_reduce_ring_slice_n: int | None = None

all_reduce_ring_slice_n: Optional[int] = None

Copilot · 2025-11-09T17:50:02Z

docker/Dockerfile.ccl

+
+# Clone and install Triton
+WORKDIR $TRITON_PATH
+RUN git clone https://github.com/triton-lang/triton.git $TRITON_PATH


The git checkout command on line 34 will fail because it's run in the wrong context. The git clone on line 33 already places the repository in $TRITON_PATH, but then you need to be inside that directory to run git checkout. However, since git clone already executed and the WORKDIR is set to $TRITON_PATH on line 32, the checkout command should work. But there's a logical issue: line 33 clones into the current directory (which is already $TRITON_PATH from line 32), effectively cloning into itself. This should be either git clone https://github.com/triton-lang/triton.git . or the WORKDIR should be set differently.

Suggested change

RUN git clone https://github.com/triton-lang/triton.git $TRITON_PATH

RUN git clone https://github.com/triton-lang/triton.git .

iris/ccl/all_to_all.py

Copilot · 2025-11-09T17:50:04Z

iris/ccl/all_to_all.py

+            rn = gl.max_contiguous(gl.multiple_of(rn, BLOCK_SIZE_N), BLOCK_SIZE_N)
+
+            # Pre-compute base offsets - maximize VGPR usage by keeping all offsets in registers
+            row_offsets_m = rm * stride_in_m


Variable row_offsets_m is not used.

Suggested change

row_offsets_m = rm * stride_in_m

Copilot · 2025-11-09T17:50:04Z

iris/ccl/all_to_all.py

+
+            # Pre-compute base offsets - maximize VGPR usage by keeping all offsets in registers
+            row_offsets_m = rm * stride_in_m
+            row_offsets_out_m = rm * stride_out_m


Variable row_offsets_out_m is not used.

Suggested change

row_offsets_out_m = rm * stride_out_m

tests/ccl/test_all_reduce.py

iris/ccl/all_to_all.py

Co-authored-by: Copilot <[email protected]>

Copilot

Pull Request Overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 5 comments.

Copilot · 2025-11-09T18:10:04Z

iris/ccl/all_reduce.py

+        expected_slice = config.block_size_n // world_size
+        if slice_n is None or slice_n * world_size != config.block_size_n:
+            slice_n = expected_slice
+        config.all_reduce_ring_slice_n = slice_n


Mutating the config object passed by the user can lead to unexpected behavior. Consider creating a copy of the config or documenting that the config object may be modified. This could cause issues when the same config is reused across multiple calls with different world_size values.

Copilot · 2025-11-09T18:10:04Z

iris/experimental/iris_gluon.py

+
+        # Optimization to vectorize the load/store - similar to iris.py
+        # This enables the compiler to generate dwordx4 or wider loads
+        # Note: Gluon uses scalar multiples, not 2D tuples like Triton


[nitpick] Commented-out optimization code should either be removed or have a clear explanation of why it's commented out and under what conditions it should be enabled. If this is a work-in-progress optimization, consider using a TODO comment or feature flag instead.

Suggested change

# Note: Gluon uses scalar multiples, not 2D tuples like Triton

# Note: Gluon uses scalar multiples, not 2D tuples like Triton

# TODO: Enable the following optimization once Gluon supports pointer alignment

# and vectorized memory accesses in the same way as Triton. Currently disabled

# due to potential incompatibility or lack of support in Gluon for these features.

Copilot · 2025-11-09T18:10:05Z

iris/ccl/all_to_all.py

+    if config.use_gluon and GLUON_AVAILABLE:
+        # Check if shmem is Iris Gluon (has get_device_context method)
+        if not hasattr(shmem, 'get_device_context'):
+            raise ValueError("use_gluon=True requires Iris Gluon context. Use iris.experimental.iris_gluon.iris()")


The error message suggests using iris.experimental.iris_gluon.iris() but the correct import path is import iris.experimental.iris_gluon as iris_gluon followed by iris_gluon.iris(). Consider updating the message to: "use_gluon=True requires Iris Gluon context. Use iris_gluon.iris() where iris_gluon is imported from iris.experimental.iris_gluon"

Suggested change

raise ValueError("use_gluon=True requires Iris Gluon context. Use iris.experimental.iris_gluon.iris()")

raise ValueError("use_gluon=True requires Iris Gluon context. Use iris_gluon.iris() where iris_gluon is imported from iris.experimental.iris_gluon")

Copilot · 2025-11-09T18:10:05Z

tests/ccl/test_all_reduce.py

+    shmem.barrier()
+    config = Config(all_reduce_variant=variant)
+    if variant == "two_shot":
+        # Test both distribution modes for two_shot


The comment says "Test both distribution modes for two_shot" but the code only tests striding mode (distribution=0). Either remove the misleading comment or test both modes (0 and 1) for the two_shot variant. There's a separate test function test_all_reduce_two_shot_distribution that tests both modes, so this comment is misleading.

Suggested change

# Test both distribution modes for two_shot

Copilot · 2025-11-09T18:10:06Z

iris/ccl/all_to_all.py

+        # Remote store offset: write into target's output at columns [cur_rank*N : (cur_rank+1)*N]
+        # This is constant for all target_rank iterations since it only depends on cur_rank
+        output_offset_remote = output_base_m + (output_base_n + cur_rank * N * stride_out_n)
+        output_ptr_remote = tl.multiple_of(output_ptr + output_offset_remote, (BLOCK_SIZE_M, BLOCK_SIZE_N))


Variable output_ptr_remote is not used.

Suggested change

output_ptr_remote = tl.multiple_of(output_ptr + output_offset_remote, (BLOCK_SIZE_M, BLOCK_SIZE_N))

Copilot AI and others added 7 commits October 19, 2025 20:29

Initial plan

beafa1c

Fix producer-consumer locking with atomic_add and proper synchronization

3fa7d2b

Co-authored-by: neoblizz <[email protected]>

Fix synchronization to wait for world_size instead of world_size-1

fd45d66

Co-authored-by: neoblizz <[email protected]>

Simplify consumer wait loop to avoid resetting tile_ready

cf9e546

Co-authored-by: neoblizz <[email protected]>

Add memory fences to ensure proper visibility of local_C writes

8d7c7e1

Co-authored-by: neoblizz <[email protected]>

Initial iris-ccl implementation.

5d1233f

Merge branch 'main' into muhosama/ccl-init

ad58f05

github-actions bot added in-progress We are working on it iris Iris project issue labels Nov 1, 2025

neoblizz marked this pull request as ready for review November 1, 2025 05:17

neoblizz requested review from BKP and mawad-amd as code owners November 1, 2025 05:17

Copilot AI review requested due to automatic review settings November 1, 2025 05:17

Copilot AI reviewed Nov 1, 2025

View reviewed changes

neoblizz and others added 8 commits November 3, 2025 08:32

Apply suggestion from @Copilot

c3ee2a1

Co-authored-by: Copilot <[email protected]>

Apply suggestion from @Copilot

798aec3

Co-authored-by: Copilot <[email protected]>

Apply suggestion from @Copilot

9f6bbe5

Co-authored-by: Copilot <[email protected]>

Apply suggestion from @Copilot

422e313

Co-authored-by: Copilot <[email protected]>

Apply suggestion from @Copilot

30207a0

Co-authored-by: Copilot <[email protected]>

Apply suggestion from @Copilot

06dc446

Co-authored-by: Copilot <[email protected]>

Apply suggestion from @Copilot

2a18785

Co-authored-by: Copilot <[email protected]>

Apply suggestion from @Copilot

a245047

Co-authored-by: Copilot <[email protected]>

neoblizz requested a review from Copilot November 3, 2025 16:35

All-to-all benchmark.py.

efef873

mawad-amd reviewed Nov 4, 2025

View reviewed changes

neoblizz added 3 commits November 4, 2025 06:47

Make ccl a class of operators.

c5dae7f

Make-shift traffic shaping algorithm - works well with iris translate…

3bc2441

… hint.

Better defaults + Gluon-based Iris All-to-all.

9bfbde9

Copilot AI reviewed Nov 7, 2025

View reviewed changes

iris.ccl.all_reduce()

ab6aa81

neoblizz added 2 commits November 9, 2025 03:16

...

1c92a40

Finalize AR algorithms variants.

0a5aed4

neoblizz requested review from Copilot and mawad-amd November 9, 2025 17:44

Copilot AI reviewed Nov 9, 2025

View reviewed changes

Apply suggestions from code review

cecfafd

Co-authored-by: Copilot <[email protected]>

neoblizz requested a review from Copilot November 9, 2025 18:04

Copilot AI reviewed Nov 9, 2025

View reviewed changes

neoblizz added 2 commits November 9, 2025 19:00

Recast from_base/to_base using gl ptr.

dcdd20a

Enable ccl tests.

eb4b259

-                output_offset_local = rm[:, None] * stride_out_m + (rn[None, :] + cur_rank * N) * stride_out_n
+                output_offset_local = rm[:, None] * stride_out_m + (rn[None, :] + cur_rank * N) * stride_out_n
+                # Use write-through cache modifier only for local stores.
+                # Write-through caching is not used for remote PUTs because remote memory operations
+                # may not support cache modifiers and could result in undefined behavior or inefficiency.

	RUN git clone https://github.com/triton-lang/triton.git $TRITON_PATH
	RUN git clone https://github.com/triton-lang/triton.git $TRITON_PATH
	# Pin Triton to commit dd58234 for ROCm 6.3 compatibility and MI300X/MI350X support.

	all_reduce_ring_slice_n: int \| None = None
	all_reduce_ring_slice_n: Optional[int] = None

-        # Note: Gluon uses scalar multiples, not 2D tuples like Triton
+        # Note: Gluon uses scalar multiples, not 2D tuples like Triton
+        # TODO: Enable the following optimization once Gluon supports pointer alignment
+        # and vectorized memory accesses in the same way as Triton. Currently disabled
+        # due to potential incompatibility or lack of support in Gluon for these features.

	raise ValueError("use_gluon=True requires Iris Gluon context. Use iris.experimental.iris_gluon.iris()")
	raise ValueError("use_gluon=True requires Iris Gluon context. Use iris_gluon.iris() where iris_gluon is imported from iris.experimental.iris_gluon")

iris-ccl Iris-based Collective Communications Library (CCL) #276

Are you sure you want to change the base?

iris-ccl Iris-based Collective Communications Library (CCL) #276

Uh oh!

Conversation

neoblizz commented Nov 1, 2025

Summary

Collective Communication Library Implementation

Benchmarking and Validation Tools

Containerization and Environment Setup

Documentation Updates

Test Infrastructure

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Copilot AI Nov 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Nov 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mawad-amd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mawad-amd Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

neoblizz Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Copilot AI Nov 9, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 9, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Nov 9, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Nov 9, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 9, 2025

Choose a reason for hiding this comment

`iris-ccl` Iris-based Collective Communications Library (CCL) #276

`iris-ccl` Iris-based Collective Communications Library (CCL) #276