-
Notifications
You must be signed in to change notification settings - Fork 295
add: ModelOpt Launcher for Slurm job submission #1031
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ChenhanYu
wants to merge
10
commits into
main
Choose a base branch
from
chenhan/modelopt-launcher
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+4,088
−12
Open
Changes from all commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
d28acd3
add: ModelOpt Launcher for Slurm job submission
ChenhanYu f3d3020
add: shared core.py, slurm_config, services, and Qwen3-8B example
ChenhanYu f7f9878
fix: add factory registry for task_configs YAML resolution
ChenhanYu ad1f0d8
chg: remove task param from launch.py, update YAML format and README
ChenhanYu 8e08365
add: common/ scripts, EAGLE3 pipeline, ADVANCED.md
ChenhanYu 22b5267
add: unit tests for launcher (64 tests, all passing)
ChenhanYu 59cdede
fix: replace Model-Optimizer submodule with symlink to parent
ChenhanYu bf91e2b
chg: docs, gitignore, hf_local global_vars, symlink auto-creation
ChenhanYu 4a05a1d
fix: skip launcher tests when nemo_run not installed, add docstrings
ChenhanYu edaaab0
chg: move launcher tests to launcher/tests/, add CI workflow
ChenhanYu File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| [submodule "launcher/modules/Megatron-LM"] | ||
| path = launcher/modules/Megatron-LM | ||
| url = https://github.com/AAnoosheh/Megatron-LM.git | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,22 @@ | ||
| # Virtual environment | ||
| .venv/ | ||
|
|
||
| # nemo-run state | ||
| .slurm_jobs | ||
| .docker_jobs.json | ||
| .local_jobs.json | ||
|
|
||
| # Experiment artifacts (generated at runtime) | ||
| experiments/ | ||
| local_experiments/ | ||
|
|
||
| # uv lock (generated, not portable) | ||
| uv.lock | ||
|
|
||
| # Python cache | ||
| __pycache__/ | ||
|
|
||
| # Editor swap files | ||
| *.swp | ||
| *.swo | ||
| *~ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,254 @@ | ||
| # Advanced Guide | ||
|
|
||
| ## Architecture | ||
|
|
||
| ### Shared Core | ||
|
|
||
| The launcher is built on a shared `core.py` module used by both: | ||
|
|
||
| - **`launch.py`** — public-facing launcher (this repo) | ||
| - **`slurm.py`** — internal CI orchestrator ([nmm-sandbox](https://gitlab-master.nvidia.com/omniml/integration/nmm-sandbox)) | ||
|
|
||
| ```text | ||
| core.py (shared) | ||
| ├── Dataclasses: SandboxTask, SandboxPipeline, GlobalVariables | ||
| ├── Executor builders: build_slurm_executor(), build_docker_executor() | ||
| ├── Job runner: run_jobs() | ||
| ├── Version reporter: report_versions() | ||
| ├── Factory registry: register_factory(), set_slurm_config_type() | ||
| └── Default env: get_default_env() | ||
|
|
||
| launch.py slurm.py (nmm-sandbox) | ||
| ├── imports core.py ├── imports core.py (via sys.path) | ||
| ├── slurm_config.py (env-var driven) ├── tools/slurm_config.py (cluster-specific) | ||
| ├── registers: slurm_factory ├── registers: oci_hsg, cw_dfw, computelab, ... | ||
| ├── packager (LAUNCHER_DIR relative) ├── packager (repo root relative) | ||
| └── launch() entrypoint └── cicd() entrypoint | ||
| ``` | ||
|
|
||
| ### Code Packaging | ||
|
|
||
| When a job is submitted, `PatternPackager` creates a tar.gz of the source code and rsyncs it to the cluster. The `code/` directory on the remote mirrors the launcher structure: | ||
|
|
||
| ```text | ||
| code/ | ||
| ├── modules/ | ||
| │ ├── Megatron-LM/megatron/... # Training framework | ||
| │ └── Model-Optimizer/modelopt/... # ModelOpt library (mounted over container install) | ||
| └── common/ | ||
| ├── megatron-lm/quantize/ | ||
| │ └── quantize.sh # PTQ quantization + MMLU | ||
| ├── tensorrt-llm/query.sh # TRT-LLM server + query | ||
| ├── vllm/query.sh # vLLM server + query | ||
| ├── eagle3/ # EAGLE3 pipeline scripts | ||
| └── query.py # OpenAI-compatible query client | ||
| ``` | ||
|
|
||
| The `modelopt/` directory is bind-mounted over the container's installed ModelOpt, so your local changes take effect without rebuilding the container. | ||
|
|
||
| ### Model-Optimizer Symlink | ||
|
|
||
| `launcher/modules/Model-Optimizer` is a **symlink** to `../..` (the Model-Optimizer root), not a git submodule. This avoids recursive nesting — the launcher lives inside Model-Optimizer and references its own parent. | ||
|
|
||
| - Git tracks the symlink natively (`git clone` preserves it) | ||
| - `launch.py` auto-creates the symlink on first run if it's missing | ||
| - The packager's `find` follows symlinks, so `modules/Model-Optimizer/modelopt/*` resolves correctly | ||
|
|
||
| ### Factory System | ||
|
|
||
| Slurm cluster configs use a factory pattern. YAMLs reference a factory by name: | ||
|
|
||
| ```yaml | ||
| slurm_config: | ||
| _factory_: "slurm_factory" | ||
| nodes: 1 | ||
| ``` | ||
|
|
||
| Factories are registered at import time via `register_factory()`. In `launch.py`, `slurm_factory` reads from environment variables (`SLURM_HOST`, `SLURM_ACCOUNT`, etc.). In `slurm.py`, `slurm_factory` resolves to a cluster-specific factory based on `SLURM_CLUSTER`: | ||
|
|
||
| ```bash | ||
| # Default (OCI-HSG) | ||
| uv run slurm.py --yaml config.yaml --yes | ||
|
|
||
| # Switch cluster | ||
| SLURM_CLUSTER=cw_dfw uv run slurm.py --yaml config.yaml --yes | ||
| ``` | ||
|
|
||
| ### YAML Formats | ||
|
|
||
| **`--yaml` format** (recommended) — maps top-level keys to function args: | ||
|
|
||
| ```yaml | ||
| job_name: Qwen3-8B_NVFP4 | ||
| pipeline: | ||
| task_0: | ||
| script: common/megatron-lm/quantize/quantize.sh | ||
| slurm_config: | ||
| _factory_: "slurm_factory" | ||
| ``` | ||
|
|
||
| **`pipeline=@` format** — bare pipeline without wrapper: | ||
|
|
||
| ```yaml | ||
| task_0: | ||
| script: common/megatron-lm/quantize/quantize.sh | ||
| slurm_config: | ||
| _factory_: "slurm_factory" | ||
| ``` | ||
|
|
||
| **Test YAML format** — list of jobs with `_target_` and overrides, used by nmm-sandbox's `tools/run_test_yaml.sh` for CI: | ||
|
|
||
| ```yaml | ||
| - _target_: Qwen/Qwen3-8B/megatron_lm_ptq.yaml | ||
| pipeline: | ||
| allow_to_fail: true | ||
| skip: false | ||
| note: "known flaky" | ||
| ``` | ||
|
|
||
| Overrides are flattened to dot-notation and passed as nemo-run CLI args (e.g., `pipeline.allow_to_fail=True`). | ||
|
|
||
| ### Global Variables | ||
|
|
||
| Pipeline YAMLs support `<<global_vars.X>>` interpolation for sharing values across tasks: | ||
|
|
||
| ```yaml | ||
| pipeline: | ||
| global_vars: | ||
| hf_model: /hf-local/Qwen/Qwen3-8B | ||
|
|
||
| task_0: | ||
| environment: | ||
| - HF_MODEL_CKPT: <<global_vars.hf_model>> | ||
|
|
||
| task_1: | ||
| environment: | ||
| - HF_MODEL_CKPT: <<global_vars.hf_model>> | ||
| ``` | ||
|
|
||
| This is resolved in `SandboxPipeline.__post_init__` using regex substitution, not OmegaConf (which fails on isolated sub-configs in nemo-run). | ||
|
|
||
| ### Metadata | ||
|
|
||
| Each experiment writes `metadata.json` to `experiments/<title>/<id>/`: | ||
|
|
||
| ```json | ||
| { | ||
| "experiment_id": "cicd_1773420387", | ||
| "job_name": "Qwen3-8B_NVFP4_DEFAULT_CFG", | ||
| "allow_to_fail": false, | ||
| "note": "" | ||
| } | ||
| ``` | ||
|
|
||
| This is used by: | ||
|
|
||
| - `tools/wait_for_experiments.sh` — skip blocking on `allow_to_fail` failures | ||
| - `tools/post_review_to_gitlab.sh` — create/update GitLab issues for allowed failures | ||
| - Claude Code's `review-logs` skill — emit `<system-out>` instead of `<failure>` in JUnit XML | ||
|
|
||
| ## Using Claude Code with the Launcher | ||
|
|
||
| Claude Code can create a tight feedback loop for model quantization experiments: configure → submit → monitor → diagnose → fix → resubmit — all from the CLI. | ||
|
|
||
| ### Setup | ||
|
|
||
| Install Claude Code and ensure the launcher is ready: | ||
|
|
||
| ```bash | ||
| npm install -g @anthropic-ai/claude-code | ||
| cd Model-Optimizer/launcher | ||
| git submodule update --init --recursive | ||
| ``` | ||
|
|
||
| ### Workflow: Submit and Monitor | ||
|
|
||
| Ask Claude Code to launch a job and wait for results: | ||
|
|
||
| ```text | ||
| > Run Qwen3-8B quantization on OCI-HSG and wait for it to finish | ||
|
|
||
| Claude will: | ||
| 1. Run: uv run launch.py --yaml Qwen/Qwen3-8B/megatron_lm_ptq.yaml --yes | ||
| 2. Monitor with: NEMORUN_HOME=$(pwd) uv run nemo experiment status <id> | ||
| 3. Fetch logs when done: NEMORUN_HOME=$(pwd) uv run nemo experiment logs <id> 0 | ||
| 4. Report the MMLU score and pass/fail status | ||
| ``` | ||
|
|
||
| ### Workflow: Diagnose Failures | ||
|
|
||
| When a job fails, ask Claude Code to analyze the logs: | ||
|
|
||
| ```text | ||
| > /review-logs | ||
|
|
||
| Claude will: | ||
| 1. Find all experiments in experiments/ | ||
| 2. Fetch logs via nemo experiment logs | ||
| 3. Read and analyze error tracebacks | ||
| 4. Produce a structured report with root cause and suggested fix | ||
| 5. Write a JUnit XML for CI integration | ||
| ``` | ||
|
|
||
| ### Workflow: Add a New Model | ||
|
|
||
| Ask Claude Code to set up a new model config: | ||
|
|
||
| ```text | ||
| > Add Llama-3.1-70B quantization config. It needs 2 nodes with 4 GPUs each. | ||
|
|
||
| Claude will: | ||
| 1. Create Meta/Llama-3.1-70B/megatron_lm_ptq.yaml | ||
| 2. Set appropriate TP/EP based on model size | ||
| 3. Reference the correct service script | ||
| 4. Test with --dryrun to verify the config | ||
| ``` | ||
|
|
||
| ### Workflow: Iterate on Failures | ||
|
|
||
| Claude Code can fix issues and resubmit in a loop: | ||
|
|
||
| ```text | ||
| > The job failed with CUDA OOM. Try reducing the sequence length to 4096 and resubmit. | ||
|
|
||
| Claude will: | ||
| 1. Edit the YAML config | ||
| 2. Resubmit with uv run launch.py --yaml <config> --yes | ||
| 3. Monitor and report results | ||
| ``` | ||
|
|
||
| ### Workflow: Reproduce and Compare | ||
|
|
||
| Use `--to-yaml` to capture configs and compare runs: | ||
|
|
||
| ```text | ||
| > Dump the resolved config for Qwen3-8B, then run it on both OCI-HSG and CW-DFW | ||
|
|
||
| Claude will: | ||
| 1. Dump: uv run launch.py --yaml Qwen/Qwen3-8B/megatron_lm_ptq.yaml --to-yaml resolved.yaml | ||
| 2. Run on OCI-HSG: SLURM_CLUSTER=oci_hsg uv run slurm.py --yaml resolved.yaml --yes | ||
| 3. Run on CW-DFW: SLURM_CLUSTER=cw_dfw uv run slurm.py --yaml resolved.yaml --yes | ||
| 4. Compare MMLU results | ||
| ``` | ||
|
|
||
| ### Skills | ||
|
|
||
| The following Claude Code skills are available in the nmm-sandbox project: | ||
|
|
||
| | Skill | Trigger | Description | | ||
| |---|---|---| | ||
| | `/review-logs` | After job completion or failure | Analyze experiment logs, diagnose failures, produce JUnit XML | | ||
| | `/wait-for-jobs` | After detached submission | Poll experiment status until all jobs finish | | ||
| | `/eagle3-new-model` | Adding a new EAGLE3 model | Generate pipeline YAML for a new model | | ||
|
|
||
| ### CI Integration | ||
|
|
||
| In CI, Claude Code runs automatically after each test job to: | ||
|
|
||
| 1. Fetch and analyze all experiment logs | ||
| 2. Generate `claude_analysis.md` with structured findings | ||
| 3. Write `claude_review_rspec.xml` for GitLab test reporting | ||
| 4. Post failure summaries as MR comments (via `tools/post_review_to_gitlab.sh`) | ||
| 5. Create/update GitLab issues for `allow_to_fail` jobs that are consistently failing | ||
|
|
||
| If the main script crashes before the review runs, an `after_script` fallback posts the captured job output to the MR so failures are always visible. |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.