Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions examples/22_rs_rmsnorm_fp8quant_ag/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
<!--
SPDX-License-Identifier: MIT
Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
-->

# Reduce-Scatter → RMSNorm → FP8 Quantization → All-Gather benchmark using Iris
This example implements a complete tensor processing pipeline across multiple GPUs:

1. **Reduce-Scatter**: Sum tensors across all GPUs and distribute shards
2. **RMSNorm**: Apply Root Mean Square normalization to each shard
3. **FP8 Quantization**: Quantize to 8-bit floating point (optional) 4. **All-Gather**: Reconstruct the full tensor across all GPUs (optional)
Copy link

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 11 contains a formatting issue where item 4 runs onto the same line as item 3 without a line break. This should be split into separate lines for proper markdown list formatting.

Suggested change
3. **FP8 Quantization**: Quantize to 8-bit floating point (optional) 4. **All-Gather**: Reconstruct the full tensor across all GPUs (optional)
3. **FP8 Quantization**: Quantize to 8-bit floating point (optional)
4. **All-Gather**: Reconstruct the full tensor across all GPUs (optional)

Copilot uses AI. Check for mistakes.

## Usage

```terminal
python benchmark.py --num_rows 8192 --num_cols 7168 --num_ranks 8 --benchmark --fp8_out --all_gather --BLOCK_M 16 --BLOCK_N 64 --num_warps 16 --num_stages 4 --waves_per_eu 4 --rmsnorm_block_size 1024 --rmsnorm_num_warps 8 --rmsnorm_num_prgms 1024 --rmsnorm_waves_per_eu 2 --fp8_block_m 64 --fp8_block_n 64 --fp8_num_warps 4 --fp8_num_stages 2 --fp8_waves_per_eu 2 --ag_block_m 64 --ag_block_n 64 --ag_num_warps 8 --ag_num_stages 3 --ag_waves_per_eu 2 --validate
```

The benchmark measures the bandwidth of each GPU receiving data from all other GPUs. Each GPU performs a load operation from every other GPU in the system, and the total bandwidth is calculated based on the total amount of data received and the time taken.
Loading
Loading