-
Notifications
You must be signed in to change notification settings - Fork 23
Open
Labels
bugSomething isn't workingSomething isn't workingexamplesExamples showcasing Iris APIs and usageExamples showcasing Iris APIs and usageirisIris project issueIris project issue
Description
Locks should be potentially done using atomic_add instead of how its being done right now. So each rank signals tile_ready by incrementing the lock by 1, and eventually when the tile_ready for that tile id reads equal to all GPUs in the world, we escape the while loop and process that tile from all ranks.
for remote_rank in range(world_size):
iris.atomic_add(tile_ready + tile_id, 1, cur_rank, remote_rank, heap_bases, sem="release", scope="sys") result = 0
while result < (world_size - 1):
compare = world_size - 1
value = 0
result = iris.atomic_cas(
tile_ready + tile_id,
compare,
value,
cur_rank,
cur_rank,
heap_bases,
sem="acquire",
scope="sys",
)However, this too results in an issue so there is potentially another bug:
python examples/17_gemm_one_shot_all_reduce_pc/benchmark.py --benchmark --num_ranks 8 -m 3840 -n 3840 -k 4352 --datatype "bf16" --gemm_sms 256 --comm_sms 48 --BLK_M 256 --BLK_N 64 --BLK_K 64 --validate
[Iris] [7/8] Validating...
[Iris] [2/8] Validating...
[Iris] [3/8] Validating...
[Iris] [4/8] Validating...
[Iris] [6/8] Validating...
[Iris] [0/8] Validating...
[Iris] [1/8] Validating...
[Iris] [5/8] Validating...
[Iris] [3/8] Max absolute difference: 346.0
[Iris] [7/8] Max absolute difference: 346.0
[Iris] [6/8] Max absolute difference: 346.0
[Iris] [4/8] Max absolute difference: 346.0
[Iris] [0/8] Max absolute difference: 346.0
[Iris] [1/8] Max absolute difference: 346.0
[Iris] [2/8] Max absolute difference: 346.0
[Iris] [5/8] Max absolute difference: 346.0
[Iris] [7/8] Mismatch at index (0, 0): C=0.0, expected=192.0
[Iris] [0/8] Mismatch at index (0, 0): C=0.0, expected=192.0
[Iris] [3/8] Mismatch at index (0, 0): C=0.0, expected=192.0
[Iris] [4/8] Mismatch at index (0, 0): C=0.0, expected=192.0
[Iris] [2/8] Mismatch at index (0, 0): C=0.0, expected=192.0
[Iris] [7/8] Final C validation failed.
[Iris] [5/8] Mismatch at index (0, 0): C=0.0, expected=192.0
[Iris] [0/8] Final C validation failed.
[Iris] [6/8] Mismatch at index (0, 0): C=0.0, expected=192.0
[Iris] [1/8] Mismatch at index (0, 0): C=0.0, expected=192.0
[Iris] [3/8] Final C validation failed.
[Iris] [4/8] Final C validation failed.
[Iris] [2/8] Final C validation failed.
[Iris] [5/8] Final C validation failed.
[Iris] [1/8] Final C validation failed.
[Iris] [6/8] Final C validation failed.
Fix the locking-waiting mechanisms, and fix the bug where some parts of C are just 0.0?
Copilot
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingexamplesExamples showcasing Iris APIs and usageExamples showcasing Iris APIs and usageirisIris project issueIris project issue