Skip to content

[Bug]: Correctness issue with Example 17 One-shot Producer-consumer kernel #266

@neoblizz

Description

@neoblizz

Locks should be potentially done using atomic_add instead of how its being done right now. So each rank signals tile_ready by incrementing the lock by 1, and eventually when the tile_ready for that tile id reads equal to all GPUs in the world, we escape the while loop and process that tile from all ranks.

for remote_rank in range(world_size):
            iris.atomic_add(tile_ready + tile_id, 1, cur_rank, remote_rank, heap_bases, sem="release", scope="sys")
        result = 0
        while result < (world_size - 1):
            compare = world_size - 1
            value = 0
            result = iris.atomic_cas(
                tile_ready + tile_id,
                compare,
                value,
                cur_rank,
                cur_rank,
                heap_bases,
                sem="acquire",
                scope="sys",
            )

However, this too results in an issue so there is potentially another bug:

python examples/17_gemm_one_shot_all_reduce_pc/benchmark.py --benchmark --num_ranks 8 -m 3840 -n 3840 -k 4352 --datatype "bf16" --gemm_sms 256 --comm_sms 48 --BLK_M 256 --BLK_N 64 --BLK_K 64 --validate
[Iris] [7/8] Validating...
[Iris] [2/8] Validating...
[Iris] [3/8] Validating...
[Iris] [4/8] Validating...
[Iris] [6/8] Validating...
[Iris] [0/8] Validating...
[Iris] [1/8] Validating...
[Iris] [5/8] Validating...
[Iris] [3/8] Max absolute difference: 346.0
[Iris] [7/8] Max absolute difference: 346.0
[Iris] [6/8] Max absolute difference: 346.0
[Iris] [4/8] Max absolute difference: 346.0
[Iris] [0/8] Max absolute difference: 346.0
[Iris] [1/8] Max absolute difference: 346.0
[Iris] [2/8] Max absolute difference: 346.0
[Iris] [5/8] Max absolute difference: 346.0
[Iris] [7/8] Mismatch at index (0, 0): C=0.0, expected=192.0
[Iris] [0/8] Mismatch at index (0, 0): C=0.0, expected=192.0
[Iris] [3/8] Mismatch at index (0, 0): C=0.0, expected=192.0
[Iris] [4/8] Mismatch at index (0, 0): C=0.0, expected=192.0
[Iris] [2/8] Mismatch at index (0, 0): C=0.0, expected=192.0
[Iris] [7/8] Final C validation failed.
[Iris] [5/8] Mismatch at index (0, 0): C=0.0, expected=192.0
[Iris] [0/8] Final C validation failed.
[Iris] [6/8] Mismatch at index (0, 0): C=0.0, expected=192.0
[Iris] [1/8] Mismatch at index (0, 0): C=0.0, expected=192.0
[Iris] [3/8] Final C validation failed.
[Iris] [4/8] Final C validation failed.
[Iris] [2/8] Final C validation failed.
[Iris] [5/8] Final C validation failed.
[Iris] [1/8] Final C validation failed.
[Iris] [6/8] Final C validation failed.

Fix the locking-waiting mechanisms, and fix the bug where some parts of C are just 0.0?

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingexamplesExamples showcasing Iris APIs and usageirisIris project issue

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions