Add reduce_scatter and All_gather benchmark #279

xiaohuguo2023 · 2025-11-06T11:58:12Z

This example implements alternative All_reduce across multiple GPUs:

Reduce-Scatter: Sum tensors across all GPUs and distribute shards
RMSNorm: Apply Root Mean Square normalization to each shard
FP8 Quantization: Quantize to 8-bit floating point (optional)
All-Gather: Reconstruct the full tensor across all GPUs (optional)

Copilot

Pull Request Overview

This PR introduces a new example (example 22) demonstrating a complete multi-GPU tensor processing pipeline using Iris. The pipeline combines reduce-scatter, RMSNorm, FP8 quantization, and all-gather operations for distributed tensor processing on AMD GPUs.

Key changes:

Implements distributed tensor processing with IRIS remote memory access operations
Provides both a standalone script and comprehensive benchmark suite
Includes validation against PyTorch reference implementations

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.

File	Description
examples/22_rs_rmsnorm_fp8quant_ag/reduce_scatter_rmsnorm_quant.py	Main implementation with Triton kernels for reduce-scatter, RMSNorm, FP8 quantization, and all-gather operations
examples/22_rs_rmsnorm_fp8quant_ag/benchmark.py	Comprehensive benchmarking suite with multi-process spawning, performance timing, and validation
examples/22_rs_rmsnorm_fp8quant_ag/README.md	Documentation with usage examples and pipeline description

Copilot · 2025-11-06T12:02:37Z

examples/22_rs_rmsnorm_fp8quant_ag/README.md

+
+1. **Reduce-Scatter**: Sum tensors across all GPUs and distribute shards
+2. **RMSNorm**: Apply Root Mean Square normalization to each shard
+3. **FP8 Quantization**: Quantize to 8-bit floating point (optional)                                                                                                      4. **All-Gather**: Reconstruct the full tensor across all GPUs (optional)


Line 11 contains a formatting issue where item 4 runs onto the same line as item 3 without a line break. This should be split into separate lines for proper markdown list formatting.

Suggested change

3. **FP8 Quantization**: Quantize to 8-bit floating point (optional) 4. **All-Gather**: Reconstruct the full tensor across all GPUs (optional)

3. **FP8 Quantization**: Quantize to 8-bit floating point (optional)

4. **All-Gather**: Reconstruct the full tensor across all GPUs (optional)

Copilot · 2025-11-06T12:02:37Z

examples/22_rs_rmsnorm_fp8quant_ag/benchmark.py

+
+    max_val = input_tensor.abs().max().item()
+    scale = max(max_val / 448.0, 1e-8)
+    scale_tensor = torch.tensor([scale], device=device, dtype=torch.float32)


The run_quantize_fp8 function has a hardcoded num_warps=16 value on line 226, but it should use the user-configurable parameters passed to the function. According to the command-line arguments (lines 91, 451), the default for FP8 quantization should be 4, not 16. The main script also uses num_warps=4 for FP8 quantization (line 522). This function should accept and use the FP8-specific parameters like the benchmark loop does (lines 889-893, 921).

Copilot · 2025-11-06T12:02:37Z

examples/22_rs_rmsnorm_fp8quant_ag/benchmark.py

+        output = torch.empty(M_shard, N, device=device, dtype=torch.float8_e4m3fn)
+    else:
+        output = torch.empty_like(input_tensor)
+


[nitpick] This function signature is extremely long with 14 parameters on a single line (extending beyond typical line length limits). Consider reformatting with one parameter per line or grouping related parameters for better readability and maintainability.

Copilot · 2025-11-06T12:02:38Z

examples/22_rs_rmsnorm_fp8quant_ag/benchmark.py

+    final_num_warps = num_warps if num_warps is not None else 8
+
+    # Set waves_per_eu (default to 2)
+    final_waves_per_eu = waves_per_eu if waves_per_eu is not None else 2


The run_quantize_fp8 function signature is missing the FP8-specific tuning parameters (num_warps, num_stages, waves_per_eu) that are available in command-line arguments and used in the benchmarking section (lines 889-893). This inconsistency means users cannot configure these parameters when calling this function, limiting its flexibility. Consider adding these parameters with defaults matching the documented values (num_warps=4, num_stages=2, waves_per_eu=0).

Copilot · 2025-11-06T12:02:38Z

examples/22_rs_rmsnorm_fp8quant_ag/reduce_scatter_rmsnorm_quant.py

+            num_warps=8,
+            num_stages=3,
+            waves_per_eu=2,
+        )


Variable result is not used.

Copilot · 2025-11-06T12:02:38Z

examples/22_rs_rmsnorm_fp8quant_ag/benchmark.py

+
+import argparse
+import json
+import os


Import of 'os' is not used.

Suggested change

import os

Copilot · 2025-11-06T12:02:38Z

examples/22_rs_rmsnorm_fp8quant_ag/benchmark.py

+import json
+import os
+import random
+import sys


Import of 'sys' is not used.

Suggested change

import sys

Copilot · 2025-11-06T12:02:39Z

examples/22_rs_rmsnorm_fp8quant_ag/benchmark.py

+import os
+import random
+import sys
+import time


Import of 'time' is not used.

Suggested change

import time

xiaohuguo2023 and others added 15 commits November 6, 2025 05:04

add example for RS+rmsnorm+f8quant+AG

1e33ba1

Apply Ruff auto-fixes

2ec746c

add torch algorithm ref implementation

149b07c

Apply Ruff auto-fixes

55bb735

add reduce_scater kernel and benchmark script

ca9cd76

new updates for tuning

cdb005a

fix reduce_scatter by use iris.load instead

c590c7a

tidy up and correct interconnect bw calculation

f1e07b2

remove rmsnorm_block_size as input, add estimated heap_size and tidy up

0f496f8

add kwargs

6d4434b

tidy up

f030a87

change directory name

8734481

add README

ccdeb00

remove and tidy up

43127db

format files with ruff

955bc24

Copilot AI review requested due to automatic review settings November 6, 2025 11:58

xiaohuguo2023 requested review from BKP, mawad-amd and neoblizz as code owners November 6, 2025 11:58

Copilot AI reviewed Nov 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add reduce_scatter and All_gather benchmark #279

Add reduce_scatter and All_gather benchmark #279

xiaohuguo2023 commented Nov 6, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 6, 2025

Uh oh!

Copilot AI Nov 6, 2025

Uh oh!

Copilot AI Nov 6, 2025

Uh oh!

Copilot AI Nov 6, 2025

Uh oh!

Copilot AI Nov 6, 2025

Uh oh!

Copilot AI Nov 6, 2025

Uh oh!

Copilot AI Nov 6, 2025

Uh oh!

Copilot AI Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	3. FP8 Quantization: Quantize to 8-bit floating point (optional) 4. All-Gather: Reconstruct the full tensor across all GPUs (optional)
	3. FP8 Quantization: Quantize to 8-bit floating point (optional)
	4. All-Gather: Reconstruct the full tensor across all GPUs (optional)

Add reduce_scatter and All_gather benchmark #279

Are you sure you want to change the base?

Add reduce_scatter and All_gather benchmark #279

Conversation

xiaohuguo2023 commented Nov 6, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant