Skip to content

[cherry-pick] Fix RCCL post-test JSON save for payloads >30 KB (#152)#167

Open
speriaswamy-amd wants to merge 4 commits into
release/cvs-0.2.0from
cherry-pick/cvs-0.2.0/pr-152
Open

[cherry-pick] Fix RCCL post-test JSON save for payloads >30 KB (#152)#167
speriaswamy-amd wants to merge 4 commits into
release/cvs-0.2.0from
cherry-pick/cvs-0.2.0/pr-152

Conversation

@speriaswamy-amd
Copy link
Copy Markdown
Contributor

Cherry-pick of #152 into release/cvs-0.2.0.

Replaces the libssh2 heredoc save (which fails above ~30 KB) with an scp push from the runner node to the MPI head, using the existing cluster_dict SSH credentials. Adds upload_file/download_file helpers in parallel_ssh_lib and unit tests.

Commits:

  • c2e9623 Fix RCCL post-test JSON save for payloads >30 KB by replacing libssh2 heredoc with scp
  • 6cf8443 Use parallel-ssh send and receive functions, replaced all usage across rccl
  • ac785ed Addressing PR comments
  • ea0f028 Format, lint and tests

Conflict resolution: trivial — scp_file collapsed to its new upload_file alias and the new TestPsshFileTransfer test class accepted as-is.

Made with Cursor

speriaswamy-amd and others added 4 commits May 14, 2026 09:07
… heredoc with scp

The post-test combined raw and aggregated RCCL JSON files are written to the
MPI head node by `rccl_lib.rccl_perf` via a `cat > path << 'EOF' ... EOF`
heredoc over an `ssh2-python` (libssh2) exec channel.

Above roughly 30 KB, libssh2 rejects the exec request with
`InvalidRequestError`, which propagates out of `Pssh.exec` and causes pytest
to mark the entire collective FAILED even though the benchmark itself ran
clean and its results were already schema-validated. At cluster scales of
~30+ nodes (combined files routinely 30-550 KB, aggregated 90-250 KB), every
collective hits this. A 62-node overnight run on Apr 30 reproduced the issue
deterministically across all 6 collectives.

This change replaces the heredoc save with an OpenSSH `scp` subprocess from
the conductor (login) node to the head node, reusing the SSH credentials
already loaded in the `cluster_dict` fixture (`username`, `priv_key_file`):

- New private helper `_push_json_to_remote_head(...)` writes the JSON to a
  temp file on the conductor and `scp -i <key> -o BatchMode=yes -o
  StrictHostKeyChecking=no <tmp> <user>@<head>:<path>`.
- `rccl_perf(...)` gains a required `cluster_dict` positional argument and
  validates it carries `username` and `priv_key_file` (`ValueError` otherwise).
  No env-var fallback, no hardcoded defaults.
- Save failure logs `log.error(...)` and continues. Benchmark and schema
  validation have already completed by this point and raw data is in
  `test.log`; this matches the file's convention of reserving `fail_test()`
  for real benchmark/validation problems.
- `tests/rccl/rccl_perf.py::test_rccl_perf` passes the existing `cluster_dict`
  fixture to `rccl_lib.rccl_perf(...)`.

Validated:
- 4-node smoke: pytest exit 0, 2 `scp push succeeded` INFO lines, 0
  `InvalidRequestError`, JSONs land on head and parse cleanly.
- 62-node overnight: 12/12 saves succeeded across 6 collectives x 2 files,
  0 failures, payloads up to ~550 KB.

Co-authored-by: Cursor <cursoragent@cursor.com>
(cherry picked from commit c2e9623)
1.  scp_file is now a thin backward-compatible alias for upload_file,
Removes the misleading except IOError: raise Exception("Expected IOError
exception") leftover
2.  Replace "conductor" terminology with "runner" across docstrings in
parallel_ssh_lib.py and rccl_lib.py
3. Add TestPsshFileTransfer with 12 tests covering happy path, recurse,
partial/total failure error message format, the {host: local_path}
contract, custom suffix_separator, and scp_file delegationa

(cherry picked from commit ac785ed)
(cherry picked from commit ea0f028)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants