Skip to content

feat(aorta): multi-node disaggregated launch via single cluster.json (AIMVT-173)#171

Open
speriaswamy-amd wants to merge 2 commits into
mainfrom
surya/aorta-multinode-disaggregated
Open

feat(aorta): multi-node disaggregated launch via single cluster.json (AIMVT-173)#171
speriaswamy-amd wants to merge 2 commits into
mainfrom
surya/aorta-multinode-disaggregated

Conversation

@speriaswamy-amd
Copy link
Copy Markdown
Contributor

Summary

Implements AIMVT-173: run the CVS Aorta benchmark across N nodes from a single cluster.json, mirroring the disaggregated launch pattern used by the existing PyTorch xDiT and SGLang multi-node test suites.

The AortaRunner now orchestrates torchrun on every node in parallel (rendezvousing on the head), and consolidates per-node torch_profiler trees into <aorta_path>/combined_traces/node_<rank>/ so the host parser sees a single unified set. Single-node behavior is unchanged: multi_node.master_launch_mode='auto' keeps the legacy experiment_script path for 1-node clusters, and pre-existing yamls without a multi_node: block still validate (Pydantic supplies sensible defaults).

What's in the diff

  • cvs/runners/aorta.pyAortaMultiNodeConfig dataclass; _resolve_launch_mode, _pick_master_port, _build_torchrun_command, _run_single_node, _collect_multi_node_traces + local/remote copy helpers; refactored run()
  • cvs/parsers/schemas.pyAortaMultiNodeConfigFile Pydantic schema + train_script existence check
  • cvs/input/config_file/aorta/aorta_benchmark.yaml — new multi_node: block with inline docs
  • docs/reference/configuration-files/aorta.rst — new "Multi-node disaggregated launch" section + parameter table
  • cvs/runners/unittests/test_aorta_multinode.py — 24 unit tests (launch-mode resolution, port selection, command construction, env merging, trace-tree copy, schema validation, single --override group invariant)
  • cvs/tests/benchmark/test_aorta.py — wire multi_node block through the runner-config fixture

Validation

End-to-end cvs run test_aorta against a real 2-node cluster (g17u19 head + f16u13 worker, 16xMI300X total) — 5/5 pytest cases pass in 148s, traces collected from both nodes, host parser produced metrics for all 16 ranks. Four runtime bugs surfaced and were fixed during this validation:

  1. Container ran as jenkins UID and couldn't open /dev/kfd despite --privileged → now passes user="root" and group_add=["video","render"]
  2. UnboundLocalError: trace_mtime in the freshest-trace selector → initialised upfront
  3. Head-node traces never collected when orchestrator is a separate login host → falls back to SSH/rsync
  4. --override key=val --override key=val … collapsed to last group only (aorta train.py uses argparse(nargs="*")) → packed behind a single --override

Test plan

  • ruff check . --exclude .venv — clean
  • ruff format --check — clean
  • python -m unittest discover -s cvs288/288 pass (existing 264 + 24 new)
  • cvs run test_aorta on real 2-node cluster — 5/5 pass
  • Backward compat: yaml without multi_node: block still validates and runs single-node
  • CI on PR

Made with Cursor

speriaswamy-amd and others added 2 commits May 14, 2026 14:01
…(AIMVT-173)

Run the CVS Aorta pipeline across N nodes from one cluster.json by orchestrating
torchrun on every node in parallel, rendezvous-ing on the head, and consolidating
per-node torch_profiler trees into <aorta_path>/combined_traces/node_<rank>/ for
the host parser. Single-node behavior is unchanged: master_launch_mode='auto'
keeps the legacy script path for 1-node clusters.

Notable runtime fixes shaken out by the 2-node validation on g17u19+f16u13:
- launch container as root (+render group) so /dev/kfd is accessible
- pull head-node traces over SSH when orchestrator != head physical host
- pack all training_overrides behind a single --override (aorta argparse uses
  nargs="*" and silently drops earlier groups otherwise)
- initialise trace_mtime before the freshest-trace comparison

Adds AortaMultiNodeConfig + Pydantic schema, refactors AortaRunner.run(),
documents the new block in docs/reference/configuration-files/aorta.rst.

Co-authored-by: Cursor <cursoragent@cursor.com>
24 unittest cases covering the new launch-mode resolution, master-port picking,
torchrun command construction, base-env merging, combined_traces helper,
local trace-tree copy, train_script existence check, and the Pydantic
AortaMultiNodeConfigFile schema. Also pins the "single --override group"
invariant in two places to prevent the argparse(nargs="*") regression.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant