Skip to content

Conversation

@xiaohuguo2023
Copy link
Member

This example implements alternative All_reduce across multiple GPUs:

  1. Reduce-Scatter: Sum tensors across all GPUs and distribute shards
  2. RMSNorm: Apply Root Mean Square normalization to each shard
  3. FP8 Quantization: Quantize to 8-bit floating point (optional)
  4. All-Gather: Reconstruct the full tensor across all GPUs (optional)

Copilot AI review requested due to automatic review settings November 6, 2025 11:58
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new example (example 22) demonstrating a complete multi-GPU tensor processing pipeline using Iris. The pipeline combines reduce-scatter, RMSNorm, FP8 quantization, and all-gather operations for distributed tensor processing on AMD GPUs.

Key changes:

  • Implements distributed tensor processing with IRIS remote memory access operations
  • Provides both a standalone script and comprehensive benchmark suite
  • Includes validation against PyTorch reference implementations

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.

File Description
examples/22_rs_rmsnorm_fp8quant_ag/reduce_scatter_rmsnorm_quant.py Main implementation with Triton kernels for reduce-scatter, RMSNorm, FP8 quantization, and all-gather operations
examples/22_rs_rmsnorm_fp8quant_ag/benchmark.py Comprehensive benchmarking suite with multi-process spawning, performance timing, and validation
examples/22_rs_rmsnorm_fp8quant_ag/README.md Documentation with usage examples and pipeline description


1. **Reduce-Scatter**: Sum tensors across all GPUs and distribute shards
2. **RMSNorm**: Apply Root Mean Square normalization to each shard
3. **FP8 Quantization**: Quantize to 8-bit floating point (optional) 4. **All-Gather**: Reconstruct the full tensor across all GPUs (optional)
Copy link

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 11 contains a formatting issue where item 4 runs onto the same line as item 3 without a line break. This should be split into separate lines for proper markdown list formatting.

Suggested change
3. **FP8 Quantization**: Quantize to 8-bit floating point (optional) 4. **All-Gather**: Reconstruct the full tensor across all GPUs (optional)
3. **FP8 Quantization**: Quantize to 8-bit floating point (optional)
4. **All-Gather**: Reconstruct the full tensor across all GPUs (optional)

Copilot uses AI. Check for mistakes.

max_val = input_tensor.abs().max().item()
scale = max(max_val / 448.0, 1e-8)
scale_tensor = torch.tensor([scale], device=device, dtype=torch.float32)
Copy link

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The run_quantize_fp8 function has a hardcoded num_warps=16 value on line 226, but it should use the user-configurable parameters passed to the function. According to the command-line arguments (lines 91, 451), the default for FP8 quantization should be 4, not 16. The main script also uses num_warps=4 for FP8 quantization (line 522). This function should accept and use the FP8-specific parameters like the benchmark loop does (lines 889-893, 921).

Copilot uses AI. Check for mistakes.
output = torch.empty(M_shard, N, device=device, dtype=torch.float8_e4m3fn)
else:
output = torch.empty_like(input_tensor)

Copy link

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] This function signature is extremely long with 14 parameters on a single line (extending beyond typical line length limits). Consider reformatting with one parameter per line or grouping related parameters for better readability and maintainability.

Copilot uses AI. Check for mistakes.
final_num_warps = num_warps if num_warps is not None else 8

# Set waves_per_eu (default to 2)
final_waves_per_eu = waves_per_eu if waves_per_eu is not None else 2
Copy link

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The run_quantize_fp8 function signature is missing the FP8-specific tuning parameters (num_warps, num_stages, waves_per_eu) that are available in command-line arguments and used in the benchmarking section (lines 889-893). This inconsistency means users cannot configure these parameters when calling this function, limiting its flexibility. Consider adding these parameters with defaults matching the documented values (num_warps=4, num_stages=2, waves_per_eu=0).

Copilot uses AI. Check for mistakes.
num_warps=8,
num_stages=3,
waves_per_eu=2,
)
Copy link

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable result is not used.

Copilot uses AI. Check for mistakes.

import argparse
import json
import os
Copy link

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'os' is not used.

Suggested change
import os

Copilot uses AI. Check for mistakes.
import json
import os
import random
import sys
Copy link

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'sys' is not used.

Suggested change
import sys

Copilot uses AI. Check for mistakes.
import os
import random
import sys
import time
Copy link

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'time' is not used.

Suggested change
import time

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant