Skip to content

[release/cvs-0.2.0] feat(aorta): multi-node disaggregated launch via single cluster.json (AIMVT-173)#172

Open
speriaswamy-amd wants to merge 2 commits into
release/cvs-0.2.0from
surya/aorta-multinode-disaggregated-release-0.2.0
Open

[release/cvs-0.2.0] feat(aorta): multi-node disaggregated launch via single cluster.json (AIMVT-173)#172
speriaswamy-amd wants to merge 2 commits into
release/cvs-0.2.0from
surya/aorta-multinode-disaggregated-release-0.2.0

Conversation

@speriaswamy-amd
Copy link
Copy Markdown
Contributor

Summary

Cherry-pick of #171 onto release/cvs-0.2.0 for AIMVT-173. Lets the CVS Aorta benchmark run across N nodes from a single cluster.json (mirrors the pattern used by the existing PyTorch xDiT and SGLang multi-node suites).

Single-node behavior is unchanged (multi_node.master_launch_mode='auto' keeps the legacy single-node path on 1-node clusters), and pre-existing yamls without a multi_node: block still validate.

Cherry-pick provenance

commit on main PR commit here
7dacef3 feat(aorta): multi-node disaggregated launch ... 709cd0a
ea1c7ba test(aorta): unit tests for multi-node launch helpers 7b987c1

One trivial conflict during cherry-pick on docs/reference/configuration-files/aorta.rst (release branch had a slightly different wording for the unrelated analysis.skip_if_exists row); resolved by keeping the release-branch wording and appending the new multi_node.* rows below it.

See #171 for the full description, code review, validation log, and the four runtime bugs surfaced by the 2-node end-to-end run.

Test plan

Made with Cursor

speriaswamy-amd and others added 2 commits May 14, 2026 14:03
…(AIMVT-173)

Run the CVS Aorta pipeline across N nodes from one cluster.json by orchestrating
torchrun on every node in parallel, rendezvous-ing on the head, and consolidating
per-node torch_profiler trees into <aorta_path>/combined_traces/node_<rank>/ for
the host parser. Single-node behavior is unchanged: master_launch_mode='auto'
keeps the legacy script path for 1-node clusters.

Notable runtime fixes shaken out by the 2-node validation on g17u19+f16u13:
- launch container as root (+render group) so /dev/kfd is accessible
- pull head-node traces over SSH when orchestrator != head physical host
- pack all training_overrides behind a single --override (aorta argparse uses
  nargs="*" and silently drops earlier groups otherwise)
- initialise trace_mtime before the freshest-trace comparison

Adds AortaMultiNodeConfig + Pydantic schema, refactors AortaRunner.run(),
documents the new block in docs/reference/configuration-files/aorta.rst.

Co-authored-by: Cursor <cursoragent@cursor.com>
24 unittest cases covering the new launch-mode resolution, master-port picking,
torchrun command construction, base-env merging, combined_traces helper,
local trace-tree copy, train_script existence check, and the Pydantic
AortaMultiNodeConfigFile schema. Also pins the "single --override group"
invariant in two places to prevent the argparse(nargs="*") regression.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant