Skip to content

[release/cvs-0.2.0] cvs/rccl: fix dmesg false positives and per-user RCCL hostfile#170

Open
speriaswamy-amd wants to merge 1 commit into
release/cvs-0.2.0from
cherry-pick/cvs-0.2.0/dmesg-pattern-fix
Open

[release/cvs-0.2.0] cvs/rccl: fix dmesg false positives and per-user RCCL hostfile#170
speriaswamy-amd wants to merge 1 commit into
release/cvs-0.2.0from
cherry-pick/cvs-0.2.0/dmesg-pattern-fix

Conversation

@speriaswamy-amd
Copy link
Copy Markdown
Contributor

Summary

Cherry-pick of #169 onto release/cvs-0.2.0. Three small fixes that came out of debugging a 28-node rccl_perf run on a shared cluster:

  • cvs/lib/verify_lib.py — drop Runlist is getting oversubscribed and Expect reduced ROCm performance from the driver failure pattern. These are amdgpu kernel info-level messages that fire routinely on large multi-rank RCCL launches even when the run is healthy. Real driver-side errors (Queue preemption failed, Failed to evict process queues, No more SDMA queue to allocate, amdgpu: process pid) are still matched.

  • cvs/tests/rccl/rccl_perf.py — flip verify_dmesg_for_errors(..., till_end_flag=False) so the dmesg scan is bounded by each parametrized test's own start_time/end_time window. With till_end_flag=True a kernel event from one test (e.g. a scatter_perf segfault) was failing every subsequent parametrized test for the rest of the run.

  • cvs/lib/rccl_lib.py — write the mpirun hostfile to /tmp/rccl_hosts_file_<USER>.txt instead of the shared /tmp/rccl_hosts_file.txt, so two users on the same shared cluster nodes don't collide. Replaces the sudo rm -f workaround on this branch.

Conflict resolution notes

  • cvs/lib/rccl_lib.py: this branch had a sudo rm -f /tmp/rccl_hosts_file.txt workaround for the same multi-user collision. The per-user path supersedes it; workaround removed.
  • cvs/tests/rccl/rccl_perf.py: main gates verify_dmesg_for_errors on if can_use_sudo: (passwordless-sudo guard not yet on this branch). For this cherry-pick the call stays unconditional — only the till_end_flag value is changed — to avoid introducing an undefined can_use_sudo reference.
  • cvs/lib/verify_lib.py: auto-merged cleanly.

Test plan

Made with Cursor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant