Skip to content

cvs/rccl: fix dmesg false positives and per-user RCCL hostfile#169

Open
speriaswamy-amd wants to merge 1 commit into
mainfrom
dmesg-pattern-fix
Open

cvs/rccl: fix dmesg false positives and per-user RCCL hostfile#169
speriaswamy-amd wants to merge 1 commit into
mainfrom
dmesg-pattern-fix

Conversation

@speriaswamy-amd
Copy link
Copy Markdown
Contributor

Summary

Three small fixes that came out of debugging a 28-node rccl_perf run on a shared cluster:

  • cvs/lib/verify_lib.py — drop Runlist is getting oversubscribed and Expect reduced ROCm performance from the driver failure pattern. These are amdgpu kernel info-level messages that fire routinely on large multi-rank RCCL launches even when the run itself is healthy. They were causing 7+ false-positive failures per run on a 28-node test_rccl_perf sweep. Real driver-side errors (Queue preemption failed, Failed to evict process queues, No more SDMA queue to allocate, amdgpu: process pid) are still matched.

  • cvs/tests/rccl/rccl_perf.py — flip verify_dmesg_for_errors(..., till_end_flag=False) so the dmesg scan is bounded by each parametrized test's own start_time/end_time window. With till_end_flag=True the scan ran from start_time to the end of the dmesg buffer, which caused a kernel event from one test (e.g. a scatter_perf segfault) to fail every subsequent parametrized test for the rest of the run.

  • cvs/lib/rccl_lib.py — write the mpirun hostfile to /tmp/rccl_hosts_file_<USER>.txt instead of the shared /tmp/rccl_hosts_file.txt. A leftover hostfile owned by another user blocks the run with Operation not permitted / Permission denied on shared cluster nodes.

Test plan

  • Verified on a 28-node group (RCCL 2.27.7 / ROCm 7.2 and RCCL 2.28.3 / ROCm 7.13) — runs that previously reported 7-9 failures now report 14/14 passed.
  • Confirmed Test failure /…/common.cu.cpp and segfault patterns still fail the correct (and only the correct) test after the per-test window change.
  • Confirmed two concurrent users on the same nodes no longer collide on the hostfile.

Made with Cursor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant