Bug
In iris/ops/matmul_all_reduce.py, matmul_all_reduce_preamble performs both:
- Workspace buffer allocation —
shmem.zeros() for locks and aux_buffer, which is a collective operation (all ranks must call it together)
- Per-call preparation —
C.zero_(), shmem.barrier()
If the workspace matches and can be reused, only some ranks call the preamble while others skip it. This causes shmem.zeros to be called from only a subset of ranks, deadlocking the collective.
Additionally, the preamble zeros the lock array and calls shmem.barrier() on every call. For lock-based variants (one_shot, two_shot), this overhead is unnecessary if versioned locks are used instead.
Impact
Deadlock when workspace is reused across different problem sizes or when ranks take different code paths.
Fix
Separate allocation from preparation:
_allocate_workspace() — only called when workspace doesn't match (shape/variant changed). Handles collective shmem.zeros for locks and aux_buffer.
_pre_kernel_sync() — called every time, but variant-specific:
atomic/spinlock: C.zero_() + stream-level barrier
one_shot/two_shot: no-op (versioned locks + overwrite semantics)
Component
iris/ops/matmul_all_reduce.py, iris/ops/workspace.py
Bug
In
iris/ops/matmul_all_reduce.py,matmul_all_reduce_preambleperforms both:shmem.zeros()for locks and aux_buffer, which is a collective operation (all ranks must call it together)C.zero_(),shmem.barrier()If the workspace matches and can be reused, only some ranks call the preamble while others skip it. This causes
shmem.zerosto be called from only a subset of ranks, deadlocking the collective.Additionally, the preamble zeros the lock array and calls
shmem.barrier()on every call. For lock-based variants (one_shot, two_shot), this overhead is unnecessary if versioned locks are used instead.Impact
Deadlock when workspace is reused across different problem sizes or when ranks take different code paths.
Fix
Separate allocation from preparation:
_allocate_workspace()— only called when workspace doesn't match (shape/variant changed). Handles collectiveshmem.zerosfor locks and aux_buffer._pre_kernel_sync()— called every time, but variant-specific:atomic/spinlock:C.zero_()+ stream-level barrierone_shot/two_shot: no-op (versioned locks + overwrite semantics)Component
iris/ops/matmul_all_reduce.py,iris/ops/workspace.py