[cherry-pick] Fix RCCL post-test JSON save for payloads >30 KB (#152)#167
Open
speriaswamy-amd wants to merge 4 commits into
Open
[cherry-pick] Fix RCCL post-test JSON save for payloads >30 KB (#152)#167speriaswamy-amd wants to merge 4 commits into
speriaswamy-amd wants to merge 4 commits into
Conversation
… heredoc with scp The post-test combined raw and aggregated RCCL JSON files are written to the MPI head node by `rccl_lib.rccl_perf` via a `cat > path << 'EOF' ... EOF` heredoc over an `ssh2-python` (libssh2) exec channel. Above roughly 30 KB, libssh2 rejects the exec request with `InvalidRequestError`, which propagates out of `Pssh.exec` and causes pytest to mark the entire collective FAILED even though the benchmark itself ran clean and its results were already schema-validated. At cluster scales of ~30+ nodes (combined files routinely 30-550 KB, aggregated 90-250 KB), every collective hits this. A 62-node overnight run on Apr 30 reproduced the issue deterministically across all 6 collectives. This change replaces the heredoc save with an OpenSSH `scp` subprocess from the conductor (login) node to the head node, reusing the SSH credentials already loaded in the `cluster_dict` fixture (`username`, `priv_key_file`): - New private helper `_push_json_to_remote_head(...)` writes the JSON to a temp file on the conductor and `scp -i <key> -o BatchMode=yes -o StrictHostKeyChecking=no <tmp> <user>@<head>:<path>`. - `rccl_perf(...)` gains a required `cluster_dict` positional argument and validates it carries `username` and `priv_key_file` (`ValueError` otherwise). No env-var fallback, no hardcoded defaults. - Save failure logs `log.error(...)` and continues. Benchmark and schema validation have already completed by this point and raw data is in `test.log`; this matches the file's convention of reserving `fail_test()` for real benchmark/validation problems. - `tests/rccl/rccl_perf.py::test_rccl_perf` passes the existing `cluster_dict` fixture to `rccl_lib.rccl_perf(...)`. Validated: - 4-node smoke: pytest exit 0, 2 `scp push succeeded` INFO lines, 0 `InvalidRequestError`, JSONs land on head and parse cleanly. - 62-node overnight: 12/12 saves succeeded across 6 collectives x 2 files, 0 failures, payloads up to ~550 KB. Co-authored-by: Cursor <cursoragent@cursor.com> (cherry picked from commit c2e9623)
…s rccl (cherry picked from commit 6cf8443)
1. scp_file is now a thin backward-compatible alias for upload_file,
Removes the misleading except IOError: raise Exception("Expected IOError
exception") leftover
2. Replace "conductor" terminology with "runner" across docstrings in
parallel_ssh_lib.py and rccl_lib.py
3. Add TestPsshFileTransfer with 12 tests covering happy path, recurse,
partial/total failure error message format, the {host: local_path}
contract, custom suffix_separator, and scp_file delegationa
(cherry picked from commit ac785ed)
(cherry picked from commit ea0f028)
cijohnson
approved these changes
May 14, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Cherry-pick of #152 into
release/cvs-0.2.0.Replaces the libssh2 heredoc save (which fails above ~30 KB) with an
scppush from the runner node to the MPI head, using the existingcluster_dictSSH credentials. Addsupload_file/download_filehelpers inparallel_ssh_liband unit tests.Commits:
c2e9623Fix RCCL post-test JSON save for payloads >30 KB by replacing libssh2 heredoc with scp6cf8443Use parallel-ssh send and receive functions, replaced all usage across rcclac785edAddressing PR commentsea0f028Format, lint and testsConflict resolution: trivial —
scp_filecollapsed to its newupload_filealias and the newTestPsshFileTransfertest class accepted as-is.Made with Cursor