[release/cvs-0.2.0] feat(aorta): multi-node disaggregated launch via single cluster.json (AIMVT-173)#172
Open
speriaswamy-amd wants to merge 2 commits into
Conversation
…(AIMVT-173) Run the CVS Aorta pipeline across N nodes from one cluster.json by orchestrating torchrun on every node in parallel, rendezvous-ing on the head, and consolidating per-node torch_profiler trees into <aorta_path>/combined_traces/node_<rank>/ for the host parser. Single-node behavior is unchanged: master_launch_mode='auto' keeps the legacy script path for 1-node clusters. Notable runtime fixes shaken out by the 2-node validation on g17u19+f16u13: - launch container as root (+render group) so /dev/kfd is accessible - pull head-node traces over SSH when orchestrator != head physical host - pack all training_overrides behind a single --override (aorta argparse uses nargs="*" and silently drops earlier groups otherwise) - initialise trace_mtime before the freshest-trace comparison Adds AortaMultiNodeConfig + Pydantic schema, refactors AortaRunner.run(), documents the new block in docs/reference/configuration-files/aorta.rst. Co-authored-by: Cursor <cursoragent@cursor.com>
24 unittest cases covering the new launch-mode resolution, master-port picking, torchrun command construction, base-env merging, combined_traces helper, local trace-tree copy, train_script existence check, and the Pydantic AortaMultiNodeConfigFile schema. Also pins the "single --override group" invariant in two places to prevent the argparse(nargs="*") regression. Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Cherry-pick of #171 onto
release/cvs-0.2.0for AIMVT-173. Lets the CVS Aorta benchmark run across N nodes from a singlecluster.json(mirrors the pattern used by the existing PyTorch xDiT and SGLang multi-node suites).Single-node behavior is unchanged (
multi_node.master_launch_mode='auto'keeps the legacy single-node path on 1-node clusters), and pre-existing yamls without amulti_node:block still validate.Cherry-pick provenance
mainPR7dacef3feat(aorta): multi-node disaggregated launch ...709cd0aea1c7batest(aorta): unit tests for multi-node launch helpers7b987c1One trivial conflict during cherry-pick on
docs/reference/configuration-files/aorta.rst(release branch had a slightly different wording for the unrelatedanalysis.skip_if_existsrow); resolved by keeping the release-branch wording and appending the newmulti_node.*rows below it.See #171 for the full description, code review, validation log, and the four runtime bugs surfaced by the 2-node end-to-end run.
Test plan
ruff check . --exclude .venv— clean on this branchpython -m unittest cvs.runners.unittests.test_aorta_multinode— 24/24 passmainfirst, then this PR ontorelease/cvs-0.2.0Made with Cursor