-
Notifications
You must be signed in to change notification settings - Fork 23
Add reduce_scatter and All_gather benchmark #279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a new example (example 22) demonstrating a complete multi-GPU tensor processing pipeline using Iris. The pipeline combines reduce-scatter, RMSNorm, FP8 quantization, and all-gather operations for distributed tensor processing on AMD GPUs.
Key changes:
- Implements distributed tensor processing with IRIS remote memory access operations
- Provides both a standalone script and comprehensive benchmark suite
- Includes validation against PyTorch reference implementations
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
| examples/22_rs_rmsnorm_fp8quant_ag/reduce_scatter_rmsnorm_quant.py | Main implementation with Triton kernels for reduce-scatter, RMSNorm, FP8 quantization, and all-gather operations |
| examples/22_rs_rmsnorm_fp8quant_ag/benchmark.py | Comprehensive benchmarking suite with multi-process spawning, performance timing, and validation |
| examples/22_rs_rmsnorm_fp8quant_ag/README.md | Documentation with usage examples and pipeline description |
|
|
||
| 1. **Reduce-Scatter**: Sum tensors across all GPUs and distribute shards | ||
| 2. **RMSNorm**: Apply Root Mean Square normalization to each shard | ||
| 3. **FP8 Quantization**: Quantize to 8-bit floating point (optional) 4. **All-Gather**: Reconstruct the full tensor across all GPUs (optional) |
Copilot
AI
Nov 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 11 contains a formatting issue where item 4 runs onto the same line as item 3 without a line break. This should be split into separate lines for proper markdown list formatting.
| 3. **FP8 Quantization**: Quantize to 8-bit floating point (optional) 4. **All-Gather**: Reconstruct the full tensor across all GPUs (optional) | |
| 3. **FP8 Quantization**: Quantize to 8-bit floating point (optional) | |
| 4. **All-Gather**: Reconstruct the full tensor across all GPUs (optional) |
|
|
||
| max_val = input_tensor.abs().max().item() | ||
| scale = max(max_val / 448.0, 1e-8) | ||
| scale_tensor = torch.tensor([scale], device=device, dtype=torch.float32) |
Copilot
AI
Nov 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The run_quantize_fp8 function has a hardcoded num_warps=16 value on line 226, but it should use the user-configurable parameters passed to the function. According to the command-line arguments (lines 91, 451), the default for FP8 quantization should be 4, not 16. The main script also uses num_warps=4 for FP8 quantization (line 522). This function should accept and use the FP8-specific parameters like the benchmark loop does (lines 889-893, 921).
| output = torch.empty(M_shard, N, device=device, dtype=torch.float8_e4m3fn) | ||
| else: | ||
| output = torch.empty_like(input_tensor) | ||
|
|
Copilot
AI
Nov 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] This function signature is extremely long with 14 parameters on a single line (extending beyond typical line length limits). Consider reformatting with one parameter per line or grouping related parameters for better readability and maintainability.
| final_num_warps = num_warps if num_warps is not None else 8 | ||
|
|
||
| # Set waves_per_eu (default to 2) | ||
| final_waves_per_eu = waves_per_eu if waves_per_eu is not None else 2 |
Copilot
AI
Nov 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The run_quantize_fp8 function signature is missing the FP8-specific tuning parameters (num_warps, num_stages, waves_per_eu) that are available in command-line arguments and used in the benchmarking section (lines 889-893). This inconsistency means users cannot configure these parameters when calling this function, limiting its flexibility. Consider adding these parameters with defaults matching the documented values (num_warps=4, num_stages=2, waves_per_eu=0).
| num_warps=8, | ||
| num_stages=3, | ||
| waves_per_eu=2, | ||
| ) |
Copilot
AI
Nov 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable result is not used.
|
|
||
| import argparse | ||
| import json | ||
| import os |
Copilot
AI
Nov 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'os' is not used.
| import os |
| import json | ||
| import os | ||
| import random | ||
| import sys |
Copilot
AI
Nov 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'sys' is not used.
| import sys |
| import os | ||
| import random | ||
| import sys | ||
| import time |
Copilot
AI
Nov 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'time' is not used.
| import time |
This example implements alternative All_reduce across multiple GPUs: