Skip to content

Port install_rvs to the orch fixture + Tier-1 dispatch tests#173

Closed
atnair-amd wants to merge 2 commits into
mainfrom
port-install-rvs-to-orch
Closed

Port install_rvs to the orch fixture + Tier-1 dispatch tests#173
atnair-amd wants to merge 2 commits into
mainfrom
port-install-rvs-to-orch

Conversation

@atnair-amd
Copy link
Copy Markdown
Contributor

Summary

  • Migrate cvs/tests/health/install/install_rvs.py from the legacy phdl/shdl Pssh handles to the shared orch fixture (cvs/tests/conftest.py), mirroring rvs_cvs.py. 21 call sites become orch.exec.
  • Drop the dead config_dict['nfs_install'] is True head-only dispatch (the JSON value is the string "True" in the canonical config, is True was always false, so head-only mode was never engaged in shipped configs).
  • Add Tier-1 dispatch unit tests covering both BaremetalOrchestrator and ContainerOrchestrator (cvs/tests/health/install/unittests/test_install_rvs_dispatch.py).

Commits

  1. eb43701 — Port install_rvs to the orch fixture; update orchestrator scope docs (5 files, +34 / -91).
  2. cf34cbf — Add Tier-1 dispatch tests for install_rvs orch port (2 files, +134).

Test plan

Tier-1 — mocked dispatch tests (devbox)

python -m unittest cvs.tests.health.install.unittests.test_install_rvs_dispatch -v:

test_dispatches_via_orch_exec_only (TestInstallRvsBaremetalDispatch) ... ok
test_dispatches_via_orch_exec_only (TestInstallRvsContainerDispatch) ... ok
----------------------------------------------------------------------
Ran 2 tests in 0.003s
OK

Tier-1 proof of detection (mutate-revert per incremental-pr-work.mdc §2)

M1 — change install_rvs.py:217 orch.exec('which rvs', timeout=30)orch.exec_on_head('which rvs', timeout=30). Both tests FAIL:

AssertionError: mock('which rvs', timeout=30) call not found

Revert → 2 PASS.

M2 — change install_rvs.py:217 timeout=30timeout=99. Both tests FAIL on the same assertion (actual call shows timeout=99). Revert → 2 PASS.

Tier-2 — real cluster (single MI300X node 10.245.128.47 / a06u43, ROCm 7.2.0, RVS pre-installed)

Baremetal orchestrator:

PYTHONPATH=/data/atnair/repos/cvs-install-rvs-port \
  ~/repos/cvs/.cvs_venv/bin/cvs run install_rvs \
  --cluster_file=/tmp/cvs_install_rvs_aimvt/cluster_baremetal.json \
  --config_file=/tmp/cvs_install_rvs_aimvt/mi300_health_config.json
# 1 passed in 20.30s
# "RVS installation and verification completed successfully"

Container orchestrator (cvs-harness/rocm-dev:rvs-installed):

PYTHONPATH=/data/atnair/repos/cvs-install-rvs-port \
  ~/repos/cvs/.cvs_venv/bin/cvs run install_rvs \
  --cluster_file=/tmp/cvs_install_rvs_aimvt/cluster_container.json \
  --config_file=/tmp/cvs_install_rvs_aimvt/mi300_health_config.json
# 1 passed in 6.03s
# Container launched: atnair_install_rvs_aimvt
# Commands routed via: sudo docker exec atnair_install_rvs_aimvt ...
# Container torn down: sudo docker rm -f atnair_install_rvs_aimvt

PYTHONPATH=... is the devbox-local workaround for the editable install in ~/repos/cvs/.cvs_venv/ pointing at the main cvs repo rather than this worktree; it ensures the cvs CLI loads the ported install_rvs.py.

Notes

  • The first commit's body mentions container Tier-2 was blocked by MultiProcessPssh.exec dropping the detailed= kwarg. That bug was fixed independently in Add upload_file and download_file support #161 (commit d666287, "Add upload_file and download_file support"), which now sits between this branch's base and HEAD. The branch was rebased onto current main to pick it up; container Tier-2 has since been exercised end-to-end and passes (see above).
  • The nfs_install head-only branch was deliberately dropped. On the canonical mi300_health_config.json, JSON had "nfs_install": "True" (a string) and the test compared with is True, so head-only was never engaged. Hand-rolled configs that relied on JSON boolean true will install on all nodes in parallel instead of just the head node; this matches what cvs run install_rvs already did via parallel SSH in shipped configs.

Out of scope / Future additions

  • Tier-2 cluster verification with the nfs_install pre/post matrix (8 stdout diffs across baremetal/container × nfs_install=true/false × pre/post) — covered in the orch_refactor-suite-install-rvs-test.md plan; queued as a separate PR.
  • Tier-1 dispatch tests for install_transferbench / install_agfhc / install_rocblas / install_babelstream — queued per-suite as those migrations land.

Switch cvs/tests/health/install/install_rvs.py from the legacy
phdl/shdl Pssh handles to the shared orch fixture in cvs/tests/conftest.py,
mirroring rvs_cvs.py. 21 phdl.exec / hdl.exec callsites become orch.exec.

Drop the head-only-on-NFS dispatch (config_dict['nfs_install'] is True).
On the canonical mi300_health_config.json the JSON value is the string
"True" while the test compared with Python `is True`, so the branch was
dead in shipped configs and head-only behavior was never engaged. Custom
configs that relied on JSON boolean true to install from a single head
node will now install on all nodes in parallel; this matches what
cvs run install_rvs already did via parallel SSH.

Layer-3 doc updates so install_rvs is no longer omitted from the
"which suites consume the orchestrator" claims:

- docs/_includes/orchestrator-scope.rst: scope note now names rvs_cvs
  and install_rvs.
- docs/reference/configuration-files/cluster-file.rst: L42 (container
  backend scope) and L107 (orchestrator key scope) updated.
- cvs/tests/health/README.md: Container-mode example pins both suites.
- docs/how-to/run-cvs-tests.rst: container-mode note pins both suites.

Verified via cvs run install_rvs in baremetal Tier-2 on a single MI300X
node (a06u01): pytest exit 0, "RVS installation and verification
completed successfully" emitted, auto-bundle written.

Container Tier-2 not exercised in this commit because the orch fixture's
container backend has two pre-existing regressions independent of this
port: MultiProcessPssh.exec drops the `detailed=` kwarg (PR #136
backward-compat regression) which blocks orch.setup_containers(), and
cvs/core/runtimes/docker.py exec does not wrap commands in `bash -c`
so shell chains like `cd X; cmake ...` get split between host and
container. Both will be fixed in follow-up PRs.
Pin three properties of the migrated install_rvs.py against both
BaremetalOrchestrator and ContainerOrchestrator orch instances:

1. dispatch flows through orch.exec (not orch.exec_on_head);
2. orch.exec(cmd, timeout=...) is accepted by both backends;
3. the legacy nfs_install head-only branch is gone (regression canary
   if someone re-introduces it).

Tests stub orch.exec / orch.exec_on_head directly on the constructed
orchestrator instance, sidestepping the signature drift between
BaremetalOrchestrator.exec(cmd, hosts=None, timeout=None) and
ContainerOrchestrator.exec(cmd, hosts=None, timeout=None, detailed=False).

Run from the repo root:

    python -m unittest \
      cvs.tests.health.install.unittests.test_install_rvs_dispatch -v
@atnair-amd atnair-amd closed this May 14, 2026
@atnair-amd atnair-amd deleted the port-install-rvs-to-orch branch May 14, 2026 23:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant