Skip to content

add: ModelOpt Launcher for Slurm job submission#1031

Open
ChenhanYu wants to merge 4 commits intomainfrom
chenhan/modelopt-launcher
Open

add: ModelOpt Launcher for Slurm job submission#1031
ChenhanYu wants to merge 4 commits intomainfrom
chenhan/modelopt-launcher

Conversation

@ChenhanYu
Copy link
Collaborator

@ChenhanYu ChenhanYu commented Mar 12, 2026

Add launcher/ module with launch.py that submits quantization, training, and evaluation jobs to Slurm clusters via nemo-run. Produces identical code/ layout as nmm-sandbox's slurm.py so the same YAML configs work in both. Includes Megatron-LM and Model-Optimizer as submodules.

What does this PR do?

Type of change: ?

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅ / ❌ / N/A
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
  • Did you write any new necessary tests?: ✅ / ❌ / N/A
  • Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

  • New Features

    • ModelOpt Launcher: submit quantization, training, and evaluation jobs via Slurm or local Docker; CLI entrypoint, YAML-driven multi-task pipelines, environment defaults, job orchestration, and experiment metadata.
  • Documentation

    • Added comprehensive launcher README with quick start, config format, examples, and usage guidance.
  • Chores

    • Added project metadata, two submodule pointers, Slurm config factory, shared launcher core, service utilities, and a Megatron-LM quantization workflow script.

Add launcher/ module with launch.py that submits quantization, training,
and evaluation jobs to Slurm clusters via nemo-run. Produces identical
code/ layout as nmm-sandbox's slurm.py so the same YAML configs work
in both. Includes Megatron-LM and Model-Optimizer as submodules.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
@ChenhanYu ChenhanYu requested a review from a team as a code owner March 12, 2026 19:46
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 12, 2026

📝 Walkthrough

Walkthrough

Adds a ModelOpt Launcher: new launcher package with Slurm config factory, pipeline/task dataclasses, Slurm and local Docker executors, YAML-driven orchestration, packaging integration, service scripts, docs, pyproject, and two git submodules (Megatron‑LM, Model‑Optimizer).

Changes

Cohort / File(s) Summary
Repository config / submodules
\.gitmodules
Adds two git submodules under launcher/modules/: Megatron-LM and Model-Optimizer.
Package metadata
launcher/pyproject.toml
New pyproject declaring modelopt-launcher, Python >=3.10, and dependencies (nemo-run git ref, pyyaml).
Launcher package init & entry
launcher/__init__.py, launcher/launch.py
Adds Apache-2.0 header and module docstring; new CLI entrypoint launch, Slurm wiring (register factory), packager setup, env defaults, and orchestration choosing Slurm vs local Docker.
Core runtime & orchestration
launcher/core.py
Large new core: SandboxTask/SandboxPipeline dataclasses, global-variable interpolation, YAML task creation, get_default_env, dynamic SlurmConfig typing, build_slurm_executor/build_docker_executor, version reporting, and run_jobs orchestration with experiment metadata output.
Slurm config factory
launcher/slurm_config.py
New SlurmConfig dataclass and slurm_factory (CLI/autoconvert) producing Slurm configuration from args and env with sensible defaults.
Service utilities & scripts
launcher/services/service_utils.sh, launcher/services/megatron-lm/quantize/quantize.sh
Adds service utilities (MPI/PMIx rank handling, error/report helpers, conditional dep install, version stamping) and a Megatron‑LM quantization workflow script chaining quantize→eval→convert/export.
Docs
launcher/README.md
New comprehensive README documenting usage, YAML format, examples, flags, and internals.
Submodule pointers
launcher/modules/Megatron-LM, launcher/modules/Model-Optimizer
Adds single-line submodule commit references for each submodule.

Sequence Diagram

sequenceDiagram
    participant CLI as User/CLI
    participant Factory as SlurmConfig\nFactory
    participant Pipeline as SandboxPipeline
    participant Packager as PatternPackager
    participant Executor as Executor\n(Slurm/Docker)
    participant Run as NemoRun\nExperiment
    participant Metadata as Metadata\n(files)

    CLI->>Factory: invoke factory (env & overrides)
    Factory-->>CLI: SlurmConfig

    CLI->>Pipeline: load task or pipeline YAML
    Pipeline->>Pipeline: interpolate globals, build job_table
    Pipeline-->>CLI: job_table

    CLI->>Packager: prepare code tarball(s)
    Packager-->>CLI: package artifacts

    loop per task
        CLI->>Executor: build_slurm_executor() or build_docker_executor()
        Executor-->>CLI: configured executor

        CLI->>Run: create Script task with env/args/executor
        Run->>Run: attach to Experiment (dependencies)
    end

    CLI->>Run: run experiment (detach?)
    Run-->>Metadata: write experiment metadata (id, job_dir, note)
    Run-->>CLI: experiment results / status
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'add: ModelOpt Launcher for Slurm job submission' directly describes the main change: adding a new launcher module with Slurm job submission capabilities, which aligns with the core objective and primary files added (launcher/launch.py, launcher/core.py, launcher/slurm_config.py).
Docstring Coverage ✅ Passed Docstring coverage is 83.33% which is sufficient. The required threshold is 80.00%.
Security Anti-Patterns ✅ Passed Pull request introduces launcher functionality without critical security anti-patterns; no unsafe torch.load, numpy.load, trust_remote_code, eval/exec, or nosec comments detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch chenhan/modelopt-launcher
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to use ModelOpt from the repo root instead of adding additional submodule?

@codecov
Copy link

codecov bot commented Mar 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.09%. Comparing base (a56b6f3) to head (ad1f0d8).
⚠️ Report is 23 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1031      +/-   ##
==========================================
- Coverage   71.73%   70.09%   -1.65%     
==========================================
  Files         211      221      +10     
  Lines       23949    25541    +1592     
==========================================
+ Hits        17181    17902     +721     
- Misses       6768     7639     +871     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.gitmodules:
- Around line 4-6: Remove the self-referential submodule entry for
"launcher/modules/Model-Optimizer" in .gitmodules (the block with path =
launcher/modules/Model-Optimizer and url =
https://github.com/NVIDIA/Model-Optimizer.git); instead rely on the top-level
Model-Optimizer working tree for packaging/mounting and remove any code that
updates the nested gitlink for launcher/modules/Model-Optimizer so you no longer
create or expect a nested checkout.
- Around line 1-3: The .gitmodules entry for the submodule at path
"launcher/modules/Megatron-LM" currently points to the personal fork URL
"https://github.com/AAnoosheh/Megatron-LM.git"; update that submodule URL to the
canonical upstream "https://github.com/NVIDIA/Megatron-LM.git" (and then run git
submodule sync && git submodule update --init --recursive) so the
launcher/modules/Megatron-LM submodule references NVIDIA/Megatron-LM instead of
the personal fork.

In `@launcher/launch.py`:
- Around line 87-89: The launcher sets container_mounts to mount the host cache
at /hf-local (variable container_mounts, local hf_local), but the code still
defaults HF_HOME to /hf-cache so the mounted cache is ignored; update places
that set the HF_HOME default (where container_mounts is None and the other
occurrences at the blocks referenced around lines 111-122 and 330-336) to align
HF_HOME with hf_local (i.e., use the hf_local value or "/hf-local" as the
HF_HOME default) so the injected environment points HuggingFace to the actual
mounted path.
- Around line 377-379: The code sets NEMORUN_HOME via
run.config.set_nemorun_home but still writes/reads metadata.json relative to the
process CWD; update all metadata file operations to use the configured NeMo Run
home instead of os.getcwd(): replace os.getcwd()/bare relative paths with
run.config.get_nemorun_home() (or run.config.nemorun_home) when constructing
paths for metadata.json and any related experiment directories (e.g., where
metadata is written around run.config.set_nemorun_home and the similar block at
the other location), ensuring both reads and writes use the same configured
home.
- Around line 407-425: The issue is that DEFAULT_LOCAL_ENV and DEFAULT_SLURM_ENV
are applied after task_env, causing task-level environment keys to be
overwritten; in the block that builds task_env (around task.environment,
hf_local, get_docker_executor and get_slurm_executor) apply the defaults before
merging task values or merge them using defaults as base (e.g., start with a
copy of DEFAULT_LOCAL_ENV/DEFAULT_SLURM_ENV then update with task.environment)
or use setdefault semantics so that explicit task.environment entries (HF_HOME,
HF_TOKEN, MLM_SKIP_INSTALL, LAUNCH_SCRIPT) override the launcher defaults;
adjust the update order where task_env.update(DEFAULT_LOCAL_ENV) /
task_env.update(DEFAULT_SLURM_ENV) currently occur so task values take
precedence.
- Around line 171-173: The code currently looks up a callable via globals()
using function_name and invokes it—replace this with an explicit allowlist
factory map (e.g., a dict mapping allowed factory names to their functions) and
use that map to resolve the factory instead of globals(); in the block that sets
function_name and slurm_config, validate that function_name is present in the
factory map, raise a clear error if not allowed, then call the resolved factory
with slurm_config (preserving the existing use of config_from_yaml["script"],
function_name, and slurm_config variables) to produce the slurm config.
- Around line 250-270: packager is using LAUNCHER_DIR for all include_pattern
entries, causing "services/*" and "tests/*" (which live at the repository root /
MODELOPT_ROOT) to be skipped; update the packager call so the relative_path list
maps the first five patterns to LAUNCHER_DIR and the last two patterns
("services/*", "tests/*") to MODELOPT_ROOT (i.e., make relative_path =
[LAUNCHER_DIR]*5 + [MODELOPT_ROOT]*2) so the include_pattern entries resolve to
the correct directories; keep the include_pattern list unchanged and only adjust
the relative_path mapping in the packager invocation.
- Around line 417-424: Replace direct uses of the internal Experiment._id:
implement a small helper get_experiment_id(exp) that obtains the experiment id
via supported public APIs (try exp.id, then exp.get_id(), then
exp.metadata.get("id") or exp.to_dict().get("id")) and raise a clear error
instructing to update nemo_run if none exist; then call get_experiment_id(exp)
instead of exp._id in get_docker_executor(...) and get_slurm_executor(...) (and
the other occurrences referenced) so the launcher no longer depends on the
private _id attribute.

In `@launcher/modules/Megatron-LM`:
- Line 1: Document why the launcher submodule uses the personal fork
https://github.com/AAnoosheh/Megatron-LM.git (branch
aanuosh…/modelopt-example-import-order) by adding a short rationale in the repo
(e.g., README or launcher packager docs): list the exact diffs/changes required
from the fork that the launcher depends on (identify files/functions/patches),
explain why those changes cannot live in NVIDIA/Megatron-LM as-is, state whether
and when the changes will be upstreamed or removed (upstream PR IDs or
timeline), and describe the maintenance plan for this fork (how it will be
rebased/merged with upstream security fixes and who owns it). Ensure references
to the submodule URL, branch name, and the launcher packager are included so
reviewers can locate the dependency and the specific modifications.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: af7e3654-7a2f-40c2-9b19-f85cc6e8f625

📥 Commits

Reviewing files that changed from the base of the PR and between 69c0d47 and d28acd3.

📒 Files selected for processing (6)
  • .gitmodules
  • launcher/__init__.py
  • launcher/launch.py
  • launcher/modules/Megatron-LM
  • launcher/modules/Model-Optimizer
  • launcher/pyproject.toml

Comment on lines +1 to +3
[submodule "launcher/modules/Megatron-LM"]
path = launcher/modules/Megatron-LM
url = https://github.com/AAnoosheh/Megatron-LM.git
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

What is the canonical GitHub repository for NVIDIA Megatron-LM, and is AAnoosheh/Megatron-LM an official upstream or a personal fork?

💡 Result:

  • Canonical (official) NVIDIA Megatron-LM repo: NVIDIA/Megatron-LM on GitHub. [1] (also linked from NVIDIA’s Megatron Core developer page under “Megatron-LM”). [3]
  • Is AAnoosheh/Megatron-LM official upstream? No—GitHub shows it is “forked from NVIDIA/Megatron-LM”, so it’s a personal fork, not the official upstream repository. [2]

Sources:
[1] NVIDIA/Megatron-LM GitHub repo page (github.com)
[2] AAnoosheh/Megatron-LM GitHub repo page (shows “forked from NVIDIA/Megatron-LM”) (github.com)
[3] NVIDIA Developer “Megatron Core” page (links to Megatron-LM on GitHub) (developer.nvidia.com)

Citations:


Replace personal fork submodule with canonical upstream.

launcher/modules/Megatron-LM points to AAnoosheh/Megatron-LM, which is a personal fork of the official NVIDIA/Megatron-LM repository. Use the canonical upstream NVIDIA/Megatron-LM instead to ensure clear provenance and maintainability of this core runtime dependency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.gitmodules around lines 1 - 3, The .gitmodules entry for the submodule at
path "launcher/modules/Megatron-LM" currently points to the personal fork URL
"https://github.com/AAnoosheh/Megatron-LM.git"; update that submodule URL to the
canonical upstream "https://github.com/NVIDIA/Megatron-LM.git" (and then run git
submodule sync && git submodule update --init --recursive) so the
launcher/modules/Megatron-LM submodule references NVIDIA/Megatron-LM instead of
the personal fork.

Comment on lines +4 to +6
[submodule "launcher/modules/Model-Optimizer"]
path = launcher/modules/Model-Optimizer
url = https://github.com/NVIDIA/Model-Optimizer.git
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Don't add this repository as a submodule of itself.

Pointing launcher/modules/Model-Optimizer back at NVIDIA/Model-Optimizer creates a self-referential submodule. Recursive checkouts can keep nesting launcher/modules/Model-Optimizer/..., and the launcher ends up with two competing copies of ModelOpt checked out at once.

🧭 Minimal direction for the fix
-[submodule "launcher/modules/Model-Optimizer"]
-	path = launcher/modules/Model-Optimizer
-	url = https://github.com/NVIDIA/Model-Optimizer.git

Use the top-level working tree for ModelOpt packaging/mounting instead of a nested checkout, and drop the matching gitlink update.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
[submodule "launcher/modules/Model-Optimizer"]
path = launcher/modules/Model-Optimizer
url = https://github.com/NVIDIA/Model-Optimizer.git
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.gitmodules around lines 4 - 6, Remove the self-referential submodule entry
for "launcher/modules/Model-Optimizer" in .gitmodules (the block with path =
launcher/modules/Model-Optimizer and url =
https://github.com/NVIDIA/Model-Optimizer.git); instead rely on the top-level
Model-Optimizer working tree for packaging/mounting and remove any code that
updates the nested gitlink for launcher/modules/Model-Optimizer so you no longer
create or expect a nested checkout.

Comment on lines +250 to +270
# Resolve paths relative to Model-Optimizer root (parent of launcher/)
LAUNCHER_DIR = os.path.dirname(os.path.abspath(__file__))
MODELOPT_ROOT = os.path.dirname(LAUNCHER_DIR)

# All paths relative to LAUNCHER_DIR so code/ mirrors the launcher directory.
# This produces the same layout as nmm-sandbox's slurm.py:
# code/modules/Megatron-LM/megatron/...
# code/modules/Model-Optimizer/modelopt/...
# code/services/...
packager = run.PatternPackager(
include_pattern=[
"modules/Megatron-LM/megatron/*",
"modules/Megatron-LM/examples/*",
"modules/Megatron-LM/*.py",
"modules/Model-Optimizer/modelopt/*",
"modules/Model-Optimizer/examples/*",
"services/*",
"tests/*",
],
relative_path=[LAUNCHER_DIR] * 7,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Packager block:"
sed -n '250,270p' launcher/launch.py

echo
echo "Matching directories anywhere in the repo:"
fd -HI -td '^services$|^tests$|^modules$' .

echo
echo "Matching directories specifically under launcher/:"
fd -HI -td '^services$|^tests$|^modules$' launcher

Repository: NVIDIA/Model-Optimizer

Length of output: 1163


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check what's under launcher/
echo "=== Contents of launcher/ (first 2 levels) ==="
tree -L 2 launcher/ 2>/dev/null || find launcher/ -maxdepth 2 -type d | head -20

echo
echo "=== Root-level directories ==="
ls -ld */ 2>/dev/null | grep -E 'services|tests|modules' || echo "None matching pattern"

echo
echo "=== Checking if these paths exist ==="
for path in launcher/services launcher/tests launcher/modules modelopt/torch services tests; do
  if [ -d "$path" ]; then
    echo "$path EXISTS"
  else
    echo "$path DOES NOT EXIST"
  fi
done

echo
echo "=== Root level tests/ directory (first few entries) ==="
ls -la tests/ 2>/dev/null | head -10 || echo "tests/ does not exist at root"

echo
echo "=== Checking services/ anywhere ==="
find . -maxdepth 2 -type d -name services 2>/dev/null || echo "No services/ directory found"

Repository: NVIDIA/Model-Optimizer

Length of output: 1159


Critical: packager will silently miss tests and services directories.

The packager is configured to resolve services/* and tests/* under launcher/, but neither directory exists there:

  • launcher/services/ does not exist anywhere in the repository
  • launcher/tests/ does not exist; the actual tests/ directory is at the repository root
  • Only launcher/modules/ exists and contains the expected subdirectories

Since MODELOPT_ROOT is computed but unused, the patterns for services/* and tests/* should resolve relative to MODELOPT_ROOT instead of LAUNCHER_DIR to match the actual directory layout. The current configuration will silently omit these files from the remote package without error.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@launcher/launch.py` around lines 250 - 270, packager is using LAUNCHER_DIR
for all include_pattern entries, causing "services/*" and "tests/*" (which live
at the repository root / MODELOPT_ROOT) to be skipped; update the packager call
so the relative_path list maps the first five patterns to LAUNCHER_DIR and the
last two patterns ("services/*", "tests/*") to MODELOPT_ROOT (i.e., make
relative_path = [LAUNCHER_DIR]*5 + [MODELOPT_ROOT]*2) so the include_pattern
entries resolve to the correct directories; keep the include_pattern list
unchanged and only adjust the relative_path mapping in the packager invocation.

@@ -0,0 +1 @@
Subproject commit 1e064f361256f34bf179c0cb808fd6287538f85a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

❓ Verification inconclusive

Script executed:

#!/bin/bash
# Description: Verify submodule configuration and check commit provenance

# Check .gitmodules configuration
echo "=== Submodule Configuration ==="
cat .gitmodules | grep -A 2 "Megatron-LM"

# Verify the commit exists in the personal fork
echo -e "\n=== Checking commit in personal fork ==="
git ls-remote https://github.com/AAnoosheh/Megatron-LM.git | grep "1e064f361256f34bf179c0cb808fd6287538f85a" || echo "Commit not found in default refs"

# Check if this commit exists in the official NVIDIA repository
echo -e "\n=== Checking if commit exists in official NVIDIA repo ==="
git ls-remote https://github.com/NVIDIA/Megatron-LM.git | head -20

Repository: NVIDIA/Model-Optimizer


Repository: NVIDIA/Model-Optimizer
Exit code: 0

stdout:

=== Submodule Configuration ===
[submodule "launcher/modules/Megatron-LM"]
	path = launcher/modules/Megatron-LM
	url = https://github.com/AAnoosheh/Megatron-LM.git
[submodule "launcher/modules/Model-Optimizer"]
	path = launcher/modules/Model-Optimizer

=== Checking commit in personal fork ===
1e064f361256f34bf179c0cb808fd6287538f85a	refs/heads/aanoosheh/modelopt-example-import-order

=== Checking if commit exists in official NVIDIA repo ===
29e798a48da0f3e091c9364c8735f4021ea65811	HEAD
e6e6d4c79de3e40babf16d88d28dae5356ca0ae3	refs/heads/23.06_pt_23.09
375395c187ff64b8d56a1cd40572bc779864b1bd	refs/heads/23.08
dd182053afc66a2195e018571537c8ce9f890df4	refs/heads/24.09-alpha
ad7d1695e00b0edd62990fe96277db0d30edf273	refs/heads/24.09-beta
53596ecae8b6e85890768f2b169fbf7911e4ed6c	refs/heads/25.04-alpha
d33919011c7010e026d7c260d9d3016852cedcd2	refs/heads/25.09-alpha
06c70f44d56298a6d19dd30ff1ebeb0e85770518	refs/heads/bump-ci-container-2026-02-20-core_r0.16.0
71247489edcee7d4ca9c3e944eb82683d96c9a7e	refs/heads/bump-ci-container-2026-02-23-main
7d2aeda817a4855d7a6aef1d7d65df615a018a1c	refs/heads/bump-ci-container-2026-02-25-core_r0.16.0
a63f83247324f395383788301714816fb059c2b2	refs/heads/bump-ci-container-2026-02-25-main
84295f3771fc28a8626ef9735915cef21a010817	refs/heads/chcui/gpt_oss
cba6b0ed0f1ee2694ae9f7059f6d4b8a7cc995fb	refs/heads/cherry-pick-2568-core_dev_r0.15.0
cba6b0ed0f1ee2694ae9f7059f6d4b8a7cc995fb	refs/heads/cherry-pick-2640-core_dev_r0.15.0
082fa76a89a376c049a8998f82d87cae2bea5009	refs/heads/cherry-pick-3319-core_r0.16.0
082fa76a89a376c049a8998f82d87cae2bea5009	refs/heads/cherry-pick-3371-core_r0.16.0
5a1bf638162875f9cd296d6785e12df53fa1ad11	refs/heads/cherry-pick-3472-core_r0.16.0
955aff043676ecab128a103b9827935b0e245972	refs/heads/cherry-pick-3483-core_r0.16.0
55df528944e0bd5de6ea259cab9776ceac961801	refs/heads/cherry-pick-3583-core_r0.16.0
1b644f55f6e0ea9ed3e26c0c9c94649a59a6dd98	refs/heads/cherry-pick-3615-core_r0.16.0

Script executed:

#!/bin/bash
# Search for documentation about the Megatron-LM fork usage
echo "=== Checking for documentation or comments ==="
find . -type f \( -name "*.md" -o -name "*.txt" -o -name "*.rst" \) | xargs grep -l "Megatron" 2>/dev/null | head -10

echo -e "\n=== Checking git history for submodule addition ==="
git log --oneline --all -- .gitmodules | head -5

echo -e "\n=== Checking commit message that added Megatron-LM submodule ==="
git log --all -p -- .gitmodules | grep -A 20 "Megatron-LM" | head -30

Repository: NVIDIA/Model-Optimizer


Repository: NVIDIA/Model-Optimizer
Exit code: 0

stdout:

=== Checking for documentation or comments ===
./CHANGELOG.rst
./CLAUDE.md
./CONTRIBUTING.md
./README.md
./docs/source/deployment/3_unified_hf.rst
./docs/source/getting_started/1_overview.rst
./docs/source/guides/3_pruning.rst
./docs/source/guides/7_nas.rst
./docs/source/guides/_pytorch_quantization.rst
./examples/llm_distill/README.md

=== Checking git history for submodule addition ===
d28acd3 add: ModelOpt Launcher for Slurm job submission

=== Checking commit message that added Megatron-LM submodule ===
    in both. Includes Megatron-LM and Model-Optimizer as submodules.
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
    Signed-off-by: Chenhan Yu <chenhany@nvidia.com>

diff --git a/.gitmodules b/.gitmodules
new file mode 100644
index 0000000..23a5af2
--- /dev/null
+++ b/.gitmodules
@@ -0,0 +1,6 @@
+[submodule "launcher/modules/Megatron-LM"]
+	path = launcher/modules/Megatron-LM
+	url = https://github.com/AAnoosheh/Megatron-LM.git
+[submodule "launcher/modules/Model-Optimizer"]
+	path = launcher/modules/Model-Optimizer
+	url = https://github.com/NVIDIA/Model-Optimizer.git

Script executed:

#!/bin/bash
# Check documentation files for context
echo "=== README.md (first 100 lines) ==="
head -100 ./README.md

echo -e "\n=== CHANGELOG.rst (entries mentioning Megatron or launcher) ==="
grep -A 3 -B 1 -i "megatron\|launcher" ./CHANGELOG.rst | head -40

echo -e "\n=== Check launcher directory structure ==="
find ./launcher -type f -name "*.md" -o -name "*.txt" -o -name "*.py" | head -20

Repository: NVIDIA/Model-Optimizer


Repository: NVIDIA/Model-Optimizer
Exit code: 0

stdout:

=== README.md (first 100 lines) ===
<div align="center">

![Banner image](docs/source/assets/model-optimizer-banner.png)

# NVIDIA Model Optimizer

[![Documentation](https://img.shields.io/badge/Documentation-latest-brightgreen.svg?style=flat)](https://nvidia.github.io/Model-Optimizer)
[![version](https://img.shields.io/pypi/v/nvidia-modelopt?label=Release)](https://pypi.org/project/nvidia-modelopt/)
[![license](https://img.shields.io/badge/License-Apache%202.0-blue)](./LICENSE)

[Documentation](https://nvidia.github.io/Model-Optimizer) |
[Roadmap](https://github.com/NVIDIA/Model-Optimizer/issues/146)

</div>

______________________________________________________________________

**NVIDIA Model Optimizer** (referred to as **Model Optimizer**, or **ModelOpt**) is a library comprising state-of-the-art model optimization [techniques](`#techniques`) including quantization, distillation, pruning, speculative decoding and sparsity to accelerate models.

**[Input]** Model Optimizer currently supports inputs of a [Hugging Face](https://huggingface.co/), [PyTorch](https://github.com/pytorch/pytorch) or [ONNX](https://github.com/onnx/onnx) model.

**[Optimize]** Model Optimizer provides Python APIs for users to easily compose the above model optimization techniques and export an optimized quantized checkpoint.
Model Optimizer is also integrated with [NVIDIA Megatron-Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge), [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) and [Hugging Face Accelerate](https://github.com/huggingface/accelerate) for training required inference optimization techniques.

**[Export for deployment]** Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like [SGLang](https://github.com/sgl-project/sglang), [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization), [TensorRT](https://github.com/NVIDIA/TensorRT), or [vLLM](https://github.com/vllm-project/vllm). The unified Hugging Face export API now supports both transformers and diffusers models.

## Latest News

- [2025/12/11] [BLOG: Top 5 AI Model Optimization Techniques for Faster, Smarter Inference](https://developer.nvidia.com/blog/top-5-ai-model-optimization-techniques-for-faster-smarter-inference/)
- [2025/12/08] NVIDIA TensorRT Model Optimizer is now officially rebranded as NVIDIA Model Optimizer.
- [2025/10/07] [BLOG: Pruning and Distilling LLMs Using NVIDIA Model Optimizer](https://developer.nvidia.com/blog/pruning-and-distilling-llms-using-nvidia-tensorrt-model-optimizer/)
- [2025/09/17] [BLOG: An Introduction to Speculative Decoding for Reducing Latency in AI Inference](https://developer.nvidia.com/blog/an-introduction-to-speculative-decoding-for-reducing-latency-in-ai-inference/)
- [2025/09/11] [BLOG: How Quantization Aware Training Enables Low-Precision Accuracy Recovery](https://developer.nvidia.com/blog/how-quantization-aware-training-enables-low-precision-accuracy-recovery/)
- [2025/08/29] [BLOG: Fine-Tuning gpt-oss for Accuracy and Performance with Quantization Aware Training](https://developer.nvidia.com/blog/fine-tuning-gpt-oss-for-accuracy-and-performance-with-quantization-aware-training/)
- [2025/08/01] [BLOG: Optimizing LLMs for Performance and Accuracy with Post-Training Quantization](https://developer.nvidia.com/blog/optimizing-llms-for-performance-and-accuracy-with-post-training-quantization/)
- [2025/06/24] [BLOG: Introducing NVFP4 for Efficient and Accurate Low-Precision Inference](https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/)
- [2025/05/14] [NVIDIA TensorRT Unlocks FP4 Image Generation for NVIDIA Blackwell GeForce RTX 50 Series GPUs](https://developer.nvidia.com/blog/nvidia-tensorrt-unlocks-fp4-image-generation-for-nvidia-blackwell-geforce-rtx-50-series-gpus/)
- [2025/04/21] [Adobe optimized deployment using Model-Optimizer + TensorRT leading to a 60% reduction in diffusion latency, a 40% reduction in total cost of ownership](https://developer.nvidia.com/blog/optimizing-transformer-based-diffusion-models-for-video-generation-with-nvidia-tensorrt/)
- [2025/04/05] [NVIDIA Accelerates Inference on Meta Llama 4 Scout and Maverick](https://developer.nvidia.com/blog/nvidia-accelerates-inference-on-meta-llama-4-scout-and-maverick/). Check out how to quantize Llama4 for deployment acceleration [here](./examples/llm_ptq/README.md#llama-4)
- [2025/03/18] [World's Fastest DeepSeek-R1 Inference with Blackwell FP4 & Increasing Image Generation Efficiency on Blackwell](https://developer.nvidia.com/blog/nvidia-blackwell-delivers-world-record-deepseek-r1-inference-performance/)
- [2025/02/25] Model Optimizer quantized NVFP4 models available on Hugging Face for download: [DeepSeek-R1-FP4](https://huggingface.co/nvidia/DeepSeek-R1-FP4), [Llama-3.3-70B-Instruct-FP4](https://huggingface.co/nvidia/Llama-3.3-70B-Instruct-FP4), [Llama-3.1-405B-Instruct-FP4](https://huggingface.co/nvidia/Llama-3.1-405B-Instruct-FP4)
- [2025/01/28] Model Optimizer has added support for NVFP4. Check out an example of NVFP4 PTQ [here](./examples/llm_ptq/README.md#model-quantization-and-trt-llm-conversion).
- [2025/01/28] Model Optimizer is now open source!

<details close>
<summary>Previous News</summary>

- [2024/10/23] Model Optimizer quantized FP8 Llama-3.1 Instruct models available on Hugging Face for download: [8B](https://huggingface.co/nvidia/Llama-3.1-8B-Instruct-FP8), [70B](https://huggingface.co/nvidia/Llama-3.1-70B-Instruct-FP8), [405B](https://huggingface.co/nvidia/Llama-3.1-405B-Instruct-FP8).
- [2024/09/10] [Post-Training Quantization of LLMs with NVIDIA NeMo and Model Optimizer](https://developer.nvidia.com/blog/post-training-quantization-of-llms-with-nvidia-nemo-and-nvidia-tensorrt-model-optimizer/).
- [2024/08/28] [Boosting Llama 3.1 405B Performance up to 44% with Model Optimizer on NVIDIA H200 GPUs](https://developer.nvidia.com/blog/boosting-llama-3-1-405b-performance-by-up-to-44-with-nvidia-tensorrt-model-optimizer-on-nvidia-h200-gpus/)
- [2024/08/28] [Up to 1.9X Higher Llama 3.1 Performance with Medusa](https://developer.nvidia.com/blog/low-latency-inference-chapter-1-up-to-1-9x-higher-llama-3-1-performance-with-medusa-on-nvidia-hgx-h200-with-nvlink-switch/)
- [2024/08/15] New features in recent releases: [Cache Diffusion](./examples/diffusers/cache_diffusion), [QLoRA workflow with NVIDIA NeMo](https://docs.nvidia.com/nemo-framework/user-guide/24.09/sft_peft/qlora.html), and more. Check out [our blog](https://developer.nvidia.com/blog/nvidia-tensorrt-model-optimizer-v0-15-boosts-inference-performance-and-expands-model-support/) for details.
- [2024/06/03] Model Optimizer now has an experimental feature to deploy to vLLM as part of our effort to support popular deployment frameworks. Check out the workflow [here](./examples/llm_ptq/README.md#deploy-fp8-quantized-model-using-vllm)
- [2024/05/08] [Announcement: Model Optimizer Now Formally Available to Further Accelerate GenAI Inference Performance](https://developer.nvidia.com/blog/accelerate-generative-ai-inference-performance-with-nvidia-tensorrt-model-optimizer-now-publicly-available/)
- [2024/03/27] [Model Optimizer supercharges TensorRT-LLM to set MLPerf LLM inference records](https://developer.nvidia.com/blog/nvidia-h200-tensor-core-gpus-and-nvidia-tensorrt-llm-set-mlperf-llm-inference-records/)
- [2024/03/18] [GTC Session: Optimize Generative AI Inference with Quantization in TensorRT-LLM and TensorRT](https://www.nvidia.com/en-us/on-demand/session/gtc24-s63213/)
- [2024/03/07] [Model Optimizer's 8-bit Post-Training Quantization enables TensorRT to accelerate Stable Diffusion to nearly 2x faster](https://developer.nvidia.com/blog/tensorrt-accelerates-stable-diffusion-nearly-2x-faster-with-8-bit-post-training-quantization/)
- [2024/02/01] [Speed up inference with Model Optimizer quantization techniques in TRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/quantization-in-TRT-LLM.md)

</details>

## Install

To install stable release packages for Model Optimizer with `pip` from [PyPI](https://pypi.org/project/nvidia-modelopt/):

```bash
pip install -U nvidia-modelopt[all]
```

To install from source in editable mode with all development dependencies or to use the latest features, run:

```bash
# Clone the Model Optimizer repository
git clone git@github.com:NVIDIA/Model-Optimizer.git
cd Model-Optimizer

pip install -e .[dev]
```

You can also directly use the [TensorRT-LLM docker images](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags)
(e.g., `nvcr.io/nvidia/tensorrt-llm/release:<version>`), which have Model Optimizer pre-installed.
Make sure to upgrade Model Optimizer to the latest version as described above.
Visit our [installation guide](https://nvidia.github.io/Model-Optimizer/getting_started/2_installation.html) for
more fine-grained control on installed dependencies or for alternative docker images and environment variables to setup.

## Techniques

<div align="center">

| **Technique** | **Description** | **Examples** | **Docs** |
| :------------: | :------------: | :------------: | :------------: |
| Post Training Quantization | Compress model size by 2x-4x, speeding up inference while preserving model quality! | \[[LLMs](./examples/llm_ptq/)\] \[[diffusers](./examples/diffusers/)\] \[[VLMs](./examples/vlm_ptq/)\] \[[onnx](./examples/onnx_ptq/)\] \[[windows](./examples/windows/)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/1_quantization.html)\] |
| Quantization Aware Training | Refine accuracy even further with a few training steps! | \[[NeMo](./examples/llm_qat#nemo-qatqad-simplified-flow-example)\] \[[Hugging Face](./examples/llm_qat/)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/1_quantization.html)\] |
| Pruning | Reduce your model size and accelerate inference by removing unnecessary weights! | \[[PyTorch](./examples/pruning/)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/3_pruning.html)\] |
| Distillation | Reduce deployment model size by teaching small models to behave like larger models! | \[[NeMo](./examples/llm_distill#knowledge-distillation-kd-for-nvidia-nemo-models)\] \[[Hugging Face](./examples/llm_distill/)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/4_distillation.html)\] |
| Speculative Decoding | Train draft modules to predict extra tokens during inference! | \[[Megatron](./examples/speculative_decoding#mlm-example)\] \[[Hugging Face](./examples/speculative_decoding/)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/5_speculative_decoding.html)\] |
| Sparsity | Efficiently compress your model by storing only its non-zero parameter values and their locations | \[[PyTorch](./examples/llm_sparsity/)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/6_sparsity.html)\] |

</div>


=== CHANGELOG.rst (entries mentioning Megatron or launcher) ===
- Add support for ``params`` constraint based automatic neural architecture search in Minitron pruning (``mcore_minitron``) as an alternative to manual pruning (using ``export_config``). See `examples/pruning/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/pruning>`_ for more details on its usage.
- New example for Minitron pruning with Megatron-Bridge framework along with advanced pruning usage with new ``params`` constraint based pruning. Also add example for distillation with Megatron-Bridge framework. Check `examples/megatron_bridge/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/megatron_bridge>`_ for example scripts.
- Add support for calibration data with multiple samples in ``npz`` format in the ONNX Autocast workflow.
- Add ``--opset`` option to ONNX quantization CLI to specify the target opset version for the quantized model.
- Add support for context parallelism in Eagle speculative decoding for huggingface and megatron core models.
- Add unified Hugging Face export support for diffusers pipelines/components.
- Add LTX-2 and Wan2.2 (T2V) support in the diffusers quantization workflow.
- Add PTQ support for GLM-4.7, including loading MTP layer weights from a separate ``mtp.safetensors`` file and export as-is.
--

- Fix Megatron KV Cache quantization checkpoint restore for QAT/QAD (device placement, amax sync across DP/TP, flash_decode compatibility).

**New Features**

- Add support for Transformer Engine quantization for Megatron Core models.
- Add support for Qwen3-Next model quantization.
- Add support for dynamically linked TensorRT plugins in the ONNX quantization workflow.
- Add support for KV Cache Quantization for vLLM FakeQuant PTQ script. See `examples/vllm_serve/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/vllm_serve#Calibrate-and-serve-fake-quant-model-in-vLLM>`__ for more details.
--
- Add support to enable custom emulated quantization backend. See :meth:`register_quant_backend <modelopt.torch.quantization.nn.modules.tensor_quantizer.register_quant_backend>`` for more details. See an example in ``tests/unit/torch/quantization/test_custom_backend.py``.
- Add ``examples/llm_qad`` for QAD training with Megatron-LM.

**Deprecations**

--
- Add ``specdec_bench`` example to benchmark speculative decoding performance. See `examples/specdec_bench/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/specdec_bench#speculative-decoding-benchmark>`_ for more details.
- Add FP8/NVFP4 KV cache quantization support for Megatron Core models.
- Add KL Divergence loss based auto_quantize method. See `auto_quantize API docs <https://nvidia.github.io/Model-Optimizer/reference/generated/modelopt.torch.quantization.model_quant.html#modelopt.torch.quantization.model_quant.auto_quantize>`_ for more details.
- Add support for saving and resuming auto_quantize search state. This speeds up the auto_quantize process by skipping the score estimation step if the search state is provided.
- Add flag ``trt_plugins_precision`` in ONNX autocast to indicate custom ops precision. This is similar to the flag already existing in the quantization workflow.
--

- Deprecate ``examples/megatron-lm`` in favor of more detailed documentation in `Megatron-LM/examples/post_training/modelopt <https://github.com/NVIDIA/Megatron-LM/tree/main/examples/post_training/modelopt>`_.

**Misc**

--
- Support storing and restoring Minitron pruning activations and scores for re-pruning without running the forward loop again.
- Add Minitron pruning example for Megatron-LM framework. See `Megatron-LM/examples/post_training/modelopt <https://github.com/NVIDIA/Megatron-LM/tree/main/examples/post_training/modelopt>`_ for more details.


=== Check launcher directory structure ===
./launcher/__init__.py
./launcher/launch.py

Script executed:

#!/bin/bash
# Check launcher code to understand dependencies on Megatron-LM fork
echo "=== Launcher launch.py content ==="
cat ./launcher/launch.py

echo -e "\n=== Check if launcher references Megatron-LM ==="
grep -r "Megatron" ./launcher/ 2>/dev/null || echo "No direct Megatron references found"

echo -e "\n=== Check launcher __init__.py ==="
cat ./launcher/__init__.py

echo -e "\n=== Check if Megatron-LM submodule is actually imported/used ==="
rg "launcher/modules/Megatron-LM" --type py --type md --type rst 2>/dev/null || echo "No references to submodule path found"

echo -e "\n=== Check recent commit on the fork branch ==="
git ls-remote https://github.com/AAnoosheh/Megatron-LM.git refs/heads/aanoosheh/modelopt-example-import-order

Repository: NVIDIA/Model-Optimizer


Repository: NVIDIA/Model-Optimizer
Exit code: 0

stdout:

=== Launcher launch.py content ===
# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""ModelOpt Launcher — submit quantization, training, and evaluation jobs to Slurm clusters.

Usage:
    uv run launch.py task=@configs/quantize/Qwen3-8B.yaml --yes
    uv run launch.py pipeline=@configs/pipeline/eagle3.yaml --yes
    uv run launch.py task=@configs/quantize/Qwen3-8B.yaml hf_local=/mnt/hf-local --yes

Environment variables:
    SLURM_HOST          Slurm login node hostname (required for remote jobs)
    SLURM_ACCOUNT       Slurm account/partition billing (default: from YAML)
    SLURM_JOB_DIR       Remote directory for job artifacts
    SLURM_HF_LOCAL      Path to HuggingFace model cache on the cluster
    HF_TOKEN            HuggingFace API token
    NEMORUN_HOME        NeMo Run home directory (default: current working directory)
"""

import dataclasses
import getpass
import json
import os
import re
import warnings
from dataclasses import dataclass

import nemo_run as run
import yaml

# ---------------------------------------------------------------------------
# Slurm configuration
# ---------------------------------------------------------------------------


`@dataclass`
class SlurmConfig:
    """Cluster-agnostic Slurm configuration.

    Users define cluster details in their YAML configs or override via CLI.
    No internal cluster defaults are embedded here.
    """

    host: str | None = None
    port: int = 22
    account: str | None = None
    partition: str = "batch"
    container: str | None = None
    modelopt_install_path: str = "/usr/local/lib/python3.12/dist-packages/modelopt"
    container_mounts: list[str] | None = None
    srun_args: list[str] | None = None
    array: str | None = None
    nodes: int = 1
    ntasks_per_node: int = 1
    gpus_per_node: int = 1
    local: bool = False


`@run.cli.factory`
`@run.autoconvert`
def slurm_factory(
    host: str = os.environ.get("SLURM_HOST", ""),
    account: str = os.environ.get("SLURM_ACCOUNT", ""),
    partition: str = "batch",
    nodes: int = 1,
    ntasks_per_node: int = 1,
    gpus_per_node: int = 1,
    container: str = "nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc5",
    modelopt_install_path: str = "/usr/local/lib/python3.12/dist-packages/modelopt",
    container_mounts: list[str] | None = None,
    srun_args: list[str] | None = None,
    array: str | None = None,
) -> SlurmConfig:
    """Generic Slurm factory — configure via environment variables or CLI overrides."""
    if container_mounts is None:
        hf_local = os.environ.get("SLURM_HF_LOCAL", "/hf-local")
        container_mounts = ["{}:/hf-local".format(hf_local)]
    if srun_args is None:
        srun_args = ["--no-container-mount-home"]
    return SlurmConfig(
        host=host,
        account=account,
        partition=partition,
        nodes=nodes,
        ntasks_per_node=ntasks_per_node,
        gpus_per_node=gpus_per_node,
        container=container,
        modelopt_install_path=modelopt_install_path,
        container_mounts=container_mounts,
        srun_args=srun_args,
        array=array,
    )


# ---------------------------------------------------------------------------
# Default environment variables injected into every job
# ---------------------------------------------------------------------------

DEFAULT_SLURM_ENV = {
    "HF_HOME": "/hf-cache",
    "HF_TOKEN": os.getenv("HF_TOKEN", ""),
    "MLM_SKIP_INSTALL": "1",
    "LAUNCH_SCRIPT": "python",
}

DEFAULT_LOCAL_ENV = {
    "HF_HOME": "/hf-cache",
    "HF_TOKEN": os.getenv("HF_TOKEN", ""),
    "MLM_SKIP_INSTALL": "1",
}


# ---------------------------------------------------------------------------
# Task and pipeline dataclasses
# ---------------------------------------------------------------------------


`@dataclass`
class SandboxTask:
    """A single task with a script, slurm config, args, and environment."""

    script: str = None
    slurm_config: SlurmConfig = None
    args: list[str] = None
    environment: list[dict] = None
    yaml_file: str = None


`@dataclass`
class SandboxTask0(SandboxTask):
    """Task slot 0 in a pipeline."""


`@dataclass`
class SandboxTask1(SandboxTask):
    """Task slot 1 in a pipeline."""


`@dataclass`
class SandboxTask2(SandboxTask):
    """Task slot 2 in a pipeline."""


`@dataclass`
class SandboxTask3(SandboxTask):
    """Task slot 3 in a pipeline."""


`@dataclass`
class SandboxTask4(SandboxTask):
    """Task slot 4 in a pipeline."""


def create_task_from_yaml(yaml_file: str) -> SandboxTask:
    """Create a SandboxTask from a YAML config file."""
    with open(yaml_file) as file:
        config_from_yaml = yaml.safe_load(file)

    script = config_from_yaml["script"]
    function_name = config_from_yaml["slurm_config"].pop("_factory_")
    slurm_config = globals()[function_name](**config_from_yaml["slurm_config"])
    args = config_from_yaml.get("args", None)
    environment = config_from_yaml.get("environment", None)

    return SandboxTask(script=script, slurm_config=slurm_config, args=args, environment=environment)


`@dataclass`
class GlobalVariables:
    """Shared variables for <<global_vars.X>> interpolation in pipeline YAMLs."""

    hf_model: str = None
    hf_data: str = None


`@dataclass`
class SandboxPipeline:
    """A multi-task pipeline with shared global variables and task dependencies."""

    global_vars: GlobalVariables = None

    task_0: SandboxTask0 = None
    task_1: SandboxTask1 = None
    task_2: SandboxTask2 = None
    task_3: SandboxTask3 = None
    task_4: SandboxTask4 = None
    tasks: list[SandboxTask] = None

    test_level: int = 0
    allow_to_fail: bool = False
    skip: bool = False
    note: str = ""
    task_configs: list[str] = None
    experiment = None

    def __post_init__(self):
        if self.tasks is None:
            self.tasks = []
            for i in range(5):
                task = getattr(self, "task_{}".format(i), None)
                if task is not None:
                    self.tasks += [task]
        if self.task_configs is not None:
            self.tasks += [
                create_task_from_yaml(yaml_file=yaml_file) for yaml_file in self.task_configs
            ]

        if self.global_vars is not None:
            global_vars_dict = {
                k: v for k, v in dataclasses.asdict(self.global_vars).items() if v is not None
            }

            def _resolve(s):
                if not isinstance(s, str):
                    return s
                return re.sub(
                    r"<<global_vars\.(\w+)>>",
                    lambda m: global_vars_dict.get(m.group(1), m.group(0)),
                    s,
                )

            for task in self.tasks:
                if task.environment:
                    if isinstance(task.environment, list):
                        task.environment = [
                            {k: _resolve(v) for k, v in item.items()} for item in task.environment
                        ]
                    else:
                        task.environment = {k: _resolve(v) for k, v in task.environment.items()}
                if task.args:
                    task.args = [_resolve(a) for a in task.args]


# ---------------------------------------------------------------------------
# Code packager — sync only the necessary source trees to the cluster
# ---------------------------------------------------------------------------

# Resolve paths relative to Model-Optimizer root (parent of launcher/)
LAUNCHER_DIR = os.path.dirname(os.path.abspath(__file__))
MODELOPT_ROOT = os.path.dirname(LAUNCHER_DIR)

# All paths relative to LAUNCHER_DIR so code/ mirrors the launcher directory.
# This produces the same layout as nmm-sandbox's slurm.py:
#   code/modules/Megatron-LM/megatron/...
#   code/modules/Model-Optimizer/modelopt/...
#   code/services/...
packager = run.PatternPackager(
    include_pattern=[
        "modules/Megatron-LM/megatron/*",
        "modules/Megatron-LM/examples/*",
        "modules/Megatron-LM/*.py",
        "modules/Model-Optimizer/modelopt/*",
        "modules/Model-Optimizer/examples/*",
        "services/*",
        "tests/*",
    ],
    relative_path=[LAUNCHER_DIR] * 7,
)


# ---------------------------------------------------------------------------
# Executor builders
# ---------------------------------------------------------------------------


def get_slurm_executor(user, identity, slurm_config, experiment_id, job_dir, task_name):
    """Build a SlurmExecutor for remote job submission."""
    container_mounts = slurm_config.container_mounts or []

    scratch_dst = "/scratchspace"
    scratch_src = job_dir + "/cicd/" + experiment_id
    modelopt_dst = slurm_config.modelopt_install_path
    modelopt_src = (
        job_dir
        + "/cicd/"
        + experiment_id
        + "/{}/code/modules/Model-Optimizer/modelopt".format(task_name)
    )
    container_mounts = [
        *container_mounts,
        scratch_src + ":" + scratch_dst,
        modelopt_src + ":" + modelopt_dst,
    ]

    tunnel = run.SSHTunnel(
        host=slurm_config.host,
        user=getpass.getuser() if user is None else user,
        port=slurm_config.port,
        job_dir=job_dir,
        identity=identity,
    )

    executor = run.SlurmExecutor(
        account=slurm_config.account,
        partition=slurm_config.partition,
        ntasks_per_node=slurm_config.ntasks_per_node,
        gpus_per_node=slurm_config.gpus_per_node,
        nodes=slurm_config.nodes,
        tunnel=tunnel,
        container_image=slurm_config.container,
        container_mounts=container_mounts,
        array=slurm_config.array,
        time="04:00:00",
        mem="0",
        retries=0,
        packager=packager,
        srun_args=slurm_config.srun_args,
    )
    return executor


def get_docker_executor(hf_local, slurm_config, experiment_id, job_dir, task_name):
    """Build a DockerExecutor for local GPU jobs."""
    if slurm_config.local:
        container_mounts = list(slurm_config.container_mounts or [])
    else:
        container_mounts = []
    container_mounts += [hf_local + ":/hf-local", job_dir + "/cicd:/cicd"]

    scratch_dst = "/scratchspace"
    scratch_src = job_dir + "/cicd/" + experiment_id + "/" + task_name
    modelopt_dst = slurm_config.modelopt_install_path
    modelopt_src = os.path.join(LAUNCHER_DIR, "modules/Model-Optimizer/modelopt")
    container_mounts += [scratch_src + ":" + scratch_dst, modelopt_src + ":" + modelopt_dst]

    executor = run.DockerExecutor(
        num_gpus=-1,
        runtime="nvidia",
        ipc_mode="host",
        container_image=slurm_config.container,
        volumes=container_mounts,
        additional_kwargs={"user": "{}:{}".format(os.getuid(), os.getgid())},
        packager=packager,
    )
    return executor


# ---------------------------------------------------------------------------
# Main entrypoint
# ---------------------------------------------------------------------------


`@run.cli.entrypoint`
def launch(
    ctx: run.cli.RunContext,
    job_name: str = "01_job",
    job_dir: str = os.environ.get("SLURM_JOB_DIR", os.path.expanduser("~/experiments")),
    task: SandboxTask | None = None,
    pipeline: SandboxPipeline | None = None,
    hf_local: str | None = None,
    user: str = getpass.getuser(),
    identity: str | None = None,
) -> None:
    """Launch ModelOpt jobs on Slurm or locally with Docker.

    Args:
        job_name: Name of the job.
        job_dir: Remote directory for job artifacts.
        task: Single task config (from YAML).
        pipeline: Multi-task pipeline config (from YAML).
        hf_local: Path to local HF cache (enables local Docker execution).
        user: SSH user for Slurm tunnel.
        identity: SSH identity file for Slurm tunnel.
    """
    if "NEMORUN_HOME" not in os.environ:
        warnings.warn("NEMORUN_HOME is not set. Defaulting to current working directory.")
    run.config.set_nemorun_home(os.environ.get("NEMORUN_HOME", os.getcwd()))

    if hf_local is not None:
        job_dir = os.getcwd() + "/experiments"

    job_table = {}

    if task is not None:
        job_table[job_name] = SandboxPipeline(tasks=[task])
    elif pipeline is not None:
        job_table[job_name] = pipeline
    else:
        print("No task or pipeline provided. Use task=@<yaml> or pipeline=@<yaml>.")
        return

    for job_name, job in job_table.items():  # noqa: PLR1704
        if job.skip:
            continue

        dependency = None
        exp = run.Experiment("modelopt", log_level="INFO")
        job.experiment = exp

        with exp:
            for task_id, task in enumerate(job.tasks):  # noqa: PLR1704
                task_name = job_name + "_" + str(task_id)
                task_args = [] if task.args is None else task.args

                task_env = {}
                if task.environment is not None:
                    if isinstance(task.environment, list):
                        for item in task.environment:
                            task_env.update(item.items())
                    else:
                        task_env = task.environment
                for k, v in task_env.items():
                    task_env[k] = "" if v is None else str(v)
                if hf_local is not None:
                    executor = get_docker_executor(
                        hf_local, task.slurm_config, exp._id, job_dir, task_name
                    )
                    task_env.update(DEFAULT_LOCAL_ENV)
                else:
                    executor = get_slurm_executor(
                        user, identity, task.slurm_config, exp._id, job_dir, task_name
                    )
                    task_env.update(DEFAULT_SLURM_ENV)

                task_instance = run.Script(task.script, args=task_args, env=task_env)
                print(
                    "job {} task {} slurm_config: {}".format(job_name, task_id, task.slurm_config)
                )

                if dependency is None:
                    dependency = exp.add(
                        task_instance, tail_logs=True, name=task_name, executor=executor
                    )
                else:
                    dependency = exp.add(
                        task_instance,
                        tail_logs=True,
                        name=task_name,
                        executor=executor,
                        dependencies=[dependency],
                    )

            exp.run(detach=ctx.detach)

        # Write metadata for downstream tools
        metadata = {
            "experiment_id": exp._id,
            "job_name": job_name,
            "allow_to_fail": job.allow_to_fail,
            "note": job.note,
        }
        metadata_path = os.path.join("experiments", "modelopt", exp._id, "metadata.json")
        os.makedirs(os.path.dirname(metadata_path), exist_ok=True)
        with open(metadata_path, "w") as f:
            json.dump(metadata, f)


if __name__ == "__main__":
    run.cli.main(launch)

=== Check if launcher references Megatron-LM ===
./launcher/launch.py:#   code/modules/Megatron-LM/megatron/...
./launcher/launch.py:        "modules/Megatron-LM/megatron/*",
./launcher/launch.py:        "modules/Megatron-LM/examples/*",
./launcher/launch.py:        "modules/Megatron-LM/*.py",

=== Check launcher __init__.py ===
# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""ModelOpt Launcher — submit quantization, training, and evaluation jobs to Slurm clusters."""

=== Check if Megatron-LM submodule is actually imported/used ===
No references to submodule path found

=== Check recent commit on the fork branch ===
1e064f361256f34bf179c0cb808fd6287538f85a	refs/heads/aanoosheh/modelopt-example-import-order

Clarify the rationale for using a personal fork of Megatron-LM in the launcher submodule.

The submodule points to https://github.com/AAnoosheh/Megatron-LM.git (a personal fork) rather than the official NVIDIA repository. The launcher's packager includes this submodule to sync code to clusters for job execution. While the fork branch (aanoosheh/modelopt-example-import-order) suggests intentional, scoped changes, the rationale is undocumented.

Please provide:

  1. What specific changes in the fork are necessary for the launcher, and why they cannot be in the official NVIDIA/Megatron-LM?
  2. Is there a plan to upstream these changes or transition to the official repository?
  3. How is this fork maintained and kept in sync with upstream security patches?

Without this documentation, this dependency represents a maintainability and supply chain risk. If these changes are general improvements, they should be contributed back to the official repository.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@launcher/modules/Megatron-LM` at line 1, Document why the launcher submodule
uses the personal fork https://github.com/AAnoosheh/Megatron-LM.git (branch
aanuosh…/modelopt-example-import-order) by adding a short rationale in the repo
(e.g., README or launcher packager docs): list the exact diffs/changes required
from the fork that the launcher depends on (identify files/functions/patches),
explain why those changes cannot live in NVIDIA/Megatron-LM as-is, state whether
and when the changes will be upstreamed or removed (upstream PR IDs or
timeline), and describe the maintenance plan for this fork (how it will be
rebased/merged with upstream security fixes and who owns it). Ensure references
to the submodule URL, branch name, and the launcher packager are included so
reviewers can locate the dependency and the specific modifications.

Extract shared logic (dataclasses, executor builders, run loop, version
reporting) into core.py. Both launch.py and nmm-sandbox's slurm.py
import from core.py to avoid divergence. Add slurm_config.py with
generic env-var-driven factory, service scripts, Qwen3-8B PTQ example,
and README with usage, flags, and bug reporting instructions.

Verified: same YAML produces identical MMLU 0.736 on OCI-HSG and 0.719
locally via both slurm.py and launch.py.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (7)
launcher/services/megatron-lm/quantize/quantize.sh (2)

46-47: exit_handler is called with an argument it doesn't accept.

exit_handler in service_utils.sh takes no parameters, but it's called with $0. This is harmless but inconsistent with its definition.

🔧 Suggested fix
 # This function handles the exit status (fails the CI).
-exit_handler $0
+exit_handler
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@launcher/services/megatron-lm/quantize/quantize.sh` around lines 46 - 47, The
call to exit_handler in quantize.sh passes an unused argument ($0) even though
exit_handler (defined in service_utils.sh) accepts no parameters; update the
call in quantize.sh to invoke exit_handler with no arguments (replace
"exit_handler $0" with "exit_handler") so it matches the function signature and
removes the inconsistency while keeping behavior unchanged.

33-36: Remove or document unused CONVERT_EXE and EXPORT_EXE.

These variables are defined but never used. If they're placeholders for future workflow steps, add a comment; otherwise remove them to reduce confusion.

🔧 Suggested fix
 QUANTIZE_EXE="bash modules/Megatron-LM/examples/post_training/modelopt/quantize.sh"
 MMLU_EXE="bash modules/Megatron-LM/examples/post_training/modelopt/mmlu.sh"
-CONVERT_EXE="bash modules/Megatron-LM/examples/post_training/modelopt/convert.sh"
-EXPORT_EXE="bash modules/Megatron-LM/examples/post_training/modelopt/export.sh"
+# TODO: Add convert and export steps
+# CONVERT_EXE="bash modules/Megatron-LM/examples/post_training/modelopt/convert.sh"
+# EXPORT_EXE="bash modules/Megatron-LM/examples/post_training/modelopt/export.sh"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@launcher/services/megatron-lm/quantize/quantize.sh` around lines 33 - 36,
CONVERT_EXE and EXPORT_EXE are defined but never used; either remove their
definitions or document them as placeholders. Edit the file to either delete the
lines defining CONVERT_EXE and EXPORT_EXE, or add a brief comment above them
stating they are intentional placeholders for future steps (e.g., "placeholder
for future convert/export scripts") and keep the variables if needed; leave
QUANTIZE_EXE and MMLU_EXE unchanged. Ensure the chosen change keeps the script
consistent (no unused vars unless explicitly documented).
launcher/services/service_utils.sh (2)

24-25: Remove or use the FAIL variable.

FAIL is set in error_handler (line 34) but never read—only FAIL_EXIT is used. Either remove FAIL or export it if external scripts rely on it.

🔧 Suggested fix
-FAIL=0
 FAIL_EXIT=0

Or if needed externally:

-FAIL=0
+export FAIL=0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@launcher/services/service_utils.sh` around lines 24 - 25, The variable FAIL
is assigned but never read; update the code path around the error_handler and
variable declarations to either remove the unused FAIL variable or
export/consume it where intended: if other scripts rely on it, export FAIL
(e.g., export FAIL) and ensure error_handler sets it; otherwise delete FAIL and
only keep FAIL_EXIT. Locate the declarations FAIL and FAIL_EXIT and the
error_handler function to make the change.

50-54: Pin the diskcache version for reproducibility.

Installing diskcache without a version specifier can lead to non-deterministic builds and potential compatibility issues.

🔧 Suggested fix
 function util_install_extra_dep {
     if [[ "$mpi_local_rank" -eq 0 ]]; then
-        pip install diskcache
+        pip install 'diskcache>=5.6,<6.0'
     fi
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@launcher/services/service_utils.sh` around lines 50 - 54, The pip install
call in function util_install_extra_dep currently installs diskcache without a
version; update util_install_extra_dep to pin diskcache to a specific, tested
version (e.g., replace the pip install diskcache invocation with a pinned
install like pip install diskcache==<VERSION>) or read the version from a single
constant/variable (e.g., DISKCACHE_VERSION) so builds are deterministic; keep
the mpi_local_rank guard unchanged.
launcher/pyproject.toml (1)

6-9: Consider pinning pyyaml version for reproducibility.

nemo-run is pinned to a specific commit hash, but pyyaml is unpinned. For consistent, reproducible builds, consider pinning to a specific version (e.g., pyyaml>=6.0,<7.0 or an exact version).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@launcher/pyproject.toml` around lines 6 - 9, The dependency list in
pyproject.toml pins "nemo-run" to a commit but leaves "pyyaml" unpinned, risking
non-reproducible builds; update the dependencies array to pin "pyyaml" to a
specific range or exact version (for example change the "pyyaml" entry in the
dependencies list to something like "pyyaml>=6.0,<7.0" or an exact
"pyyaml==6.0.1") so that the dependencies block in pyproject.toml
deterministically resolves.
launcher/slurm_config.py (2)

58-61: Mutable default arguments in function signature.

container_mounts and srun_args use list literals as default values. In Python, mutable defaults are evaluated once at function definition time and shared across all calls, which can lead to unexpected behavior if the lists are modified.

🔧 Suggested fix
 def slurm_factory(
     host: str = os.environ.get("SLURM_HOST", ""),
     account: str = os.environ.get("SLURM_ACCOUNT", ""),
     partition: str = "batch",
     nodes: int = 1,
     ntasks_per_node: int = 1,
     gpus_per_node: int = 1,
     container: str = "nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc5",
     modelopt_install_path: str = "/usr/local/lib/python3.12/dist-packages/modelopt",
-    container_mounts: list[str] = [
-        "{}:/hf-local".format(os.environ.get("SLURM_HF_LOCAL", "/hf-local")),
-    ],
-    srun_args: list[str] = ["--no-container-mount-home"],
+    container_mounts: list[str] | None = None,
+    srun_args: list[str] | None = None,
     array: str = None,  # noqa: RUF013
 ) -> SlurmConfig:
     """Generic Slurm factory — configure via environment variables or CLI overrides."""
+    if container_mounts is None:
+        container_mounts = [
+            "{}:/hf-local".format(os.environ.get("SLURM_HF_LOCAL", "/hf-local")),
+        ]
+    if srun_args is None:
+        srun_args = ["--no-container-mount-home"]
     return SlurmConfig(
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@launcher/slurm_config.py` around lines 58 - 61, The defaults for
container_mounts and srun_args are currently mutable list literals; change their
default values to None in the function or constructor signature (e.g.,
container_mounts: Optional[list[str]] = None, srun_args: Optional[list[str]] =
None) and inside the function initialize them to fresh lists if None (create the
container_mounts list using
"{}:/hf-local".format(os.environ.get("SLURM_HF_LOCAL", "/hf-local")) and set
srun_args to ["--no-container-mount-home"]) so each call gets its own list and
avoids shared mutable state; update any references to these parameters
accordingly (look for container_mounts and srun_args in the function/class where
they are defined).

32-44: Type hints don't match default values.

Several fields have type hints (e.g., str, list[str]) but default to None. Consider using union types for clarity:

🔧 Suggested fix
 `@dataclass`
 class SlurmConfig:
-    host: str = None
+    host: str | None = None
     port: int = 22
-    account: str = None
+    account: str | None = None
     partition: str = "batch"
-    container: str = None
+    container: str | None = None
     modelopt_install_path: str = "/usr/local/lib/python3.12/dist-packages/modelopt"
-    container_mounts: list[str] = None
-    srun_args: list[str] = None
-    array: str = None
+    container_mounts: list[str] | None = None
+    srun_args: list[str] | None = None
+    array: str | None = None
     nodes: int = 1
     ntasks_per_node: int = 1
     gpus_per_node: int = 1
     local: bool = False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@launcher/slurm_config.py` around lines 32 - 44, The annotated types for
several dataclass/config fields (host, account, container, container_mounts,
srun_args, array) conflict with their None defaults; update their type hints to
allow None (e.g., use Optional[str] for host/account/container/array and
Optional[list[str]] for container_mounts/srun_args) and add the necessary import
from typing (Optional) or use union syntax (str | None, list[str] | None) so the
annotation matches the default values; keep the existing defaults (None) and
leave other fields (port, nodes, ntasks_per_node, gpus_per_node,
modelopt_install_path, local, partition) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@launcher/core.py`:
- Around line 313-329: Remove the inline "# nosec" comments: delete the "# nosec
B404" on the "import subprocess" line and the "# nosec B603 B607" comments on
the subprocess.run calls that set the commit and branch variables (the lines
invoking git rev-parse for commit and branch). Keep the subprocess calls as-is
(list args, no shell=True, capture_output/text/timeout) and instead add to your
PR description or a nearby code comment a request for formal review by
`@NVIDIA/modelopt-setup-codeowners` with documented justification for why these
subprocess usages are safe.

In `@launcher/services/megatron-lm/quantize/quantize.sh`:
- Line 38: The assignment to MLM_EXTRA_ARGS incorrectly expands "${@}" into a
scalar; change it to either capture all args as a single string using "$*" (if
you want one string) or make MLM_EXTRA_ARGS an array by declaring
MLM_EXTRA_ARGS=("$@") (if you need separate elements), and update subsequent
uses of MLM_EXTRA_ARGS accordingly so they treat it as a string or an array as
intended.

In `@launcher/services/service_utils.sh`:
- Around line 59-62: Replace the runtime append of __version__ from the script:
stop modifying modules/Model-Optimizer/modelopt/__init__.py at runtime and
instead write/read a dedicated version file (e.g.,
modules/Model-Optimizer/VERSION) or inject the version at build-time; remove the
block that checks mpi_local_rank and echoes "__version__ = '1.0.0'"; implement a
safe creation/update step that checks for an existing VERSION file (or existing
__version__ entry) before writing and ensure the write happens in a
build/pre-deploy step or by a single global coordinator (not per-node
mpi_local_rank==0) to avoid race conditions and git dirty state.

---

Nitpick comments:
In `@launcher/pyproject.toml`:
- Around line 6-9: The dependency list in pyproject.toml pins "nemo-run" to a
commit but leaves "pyyaml" unpinned, risking non-reproducible builds; update the
dependencies array to pin "pyyaml" to a specific range or exact version (for
example change the "pyyaml" entry in the dependencies list to something like
"pyyaml>=6.0,<7.0" or an exact "pyyaml==6.0.1") so that the dependencies block
in pyproject.toml deterministically resolves.

In `@launcher/services/megatron-lm/quantize/quantize.sh`:
- Around line 46-47: The call to exit_handler in quantize.sh passes an unused
argument ($0) even though exit_handler (defined in service_utils.sh) accepts no
parameters; update the call in quantize.sh to invoke exit_handler with no
arguments (replace "exit_handler $0" with "exit_handler") so it matches the
function signature and removes the inconsistency while keeping behavior
unchanged.
- Around line 33-36: CONVERT_EXE and EXPORT_EXE are defined but never used;
either remove their definitions or document them as placeholders. Edit the file
to either delete the lines defining CONVERT_EXE and EXPORT_EXE, or add a brief
comment above them stating they are intentional placeholders for future steps
(e.g., "placeholder for future convert/export scripts") and keep the variables
if needed; leave QUANTIZE_EXE and MMLU_EXE unchanged. Ensure the chosen change
keeps the script consistent (no unused vars unless explicitly documented).

In `@launcher/services/service_utils.sh`:
- Around line 24-25: The variable FAIL is assigned but never read; update the
code path around the error_handler and variable declarations to either remove
the unused FAIL variable or export/consume it where intended: if other scripts
rely on it, export FAIL (e.g., export FAIL) and ensure error_handler sets it;
otherwise delete FAIL and only keep FAIL_EXIT. Locate the declarations FAIL and
FAIL_EXIT and the error_handler function to make the change.
- Around line 50-54: The pip install call in function util_install_extra_dep
currently installs diskcache without a version; update util_install_extra_dep to
pin diskcache to a specific, tested version (e.g., replace the pip install
diskcache invocation with a pinned install like pip install
diskcache==<VERSION>) or read the version from a single constant/variable (e.g.,
DISKCACHE_VERSION) so builds are deterministic; keep the mpi_local_rank guard
unchanged.

In `@launcher/slurm_config.py`:
- Around line 58-61: The defaults for container_mounts and srun_args are
currently mutable list literals; change their default values to None in the
function or constructor signature (e.g., container_mounts: Optional[list[str]] =
None, srun_args: Optional[list[str]] = None) and inside the function initialize
them to fresh lists if None (create the container_mounts list using
"{}:/hf-local".format(os.environ.get("SLURM_HF_LOCAL", "/hf-local")) and set
srun_args to ["--no-container-mount-home"]) so each call gets its own list and
avoids shared mutable state; update any references to these parameters
accordingly (look for container_mounts and srun_args in the function/class where
they are defined).
- Around line 32-44: The annotated types for several dataclass/config fields
(host, account, container, container_mounts, srun_args, array) conflict with
their None defaults; update their type hints to allow None (e.g., use
Optional[str] for host/account/container/array and Optional[list[str]] for
container_mounts/srun_args) and add the necessary import from typing (Optional)
or use union syntax (str | None, list[str] | None) so the annotation matches the
default values; keep the existing defaults (None) and leave other fields (port,
nodes, ntasks_per_node, gpus_per_node, modelopt_install_path, local, partition)
unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 56d56135-8c67-4a05-b578-e10120a5789f

📥 Commits

Reviewing files that changed from the base of the PR and between d28acd3 and f3d3020.

📒 Files selected for processing (8)
  • launcher/Qwen/Qwen3-8B/megatron_lm_ptq.yaml
  • launcher/README.md
  • launcher/core.py
  • launcher/launch.py
  • launcher/pyproject.toml
  • launcher/services/megatron-lm/quantize/quantize.sh
  • launcher/services/service_utils.sh
  • launcher/slurm_config.py
✅ Files skipped from review due to trivial changes (1)
  • launcher/README.md

Comment on lines +313 to +329
import subprocess # nosec B404

try:
commit = subprocess.run( # nosec B603 B607
["git", "rev-parse", "--short", "HEAD"],
cwd=path,
capture_output=True,
text=True,
timeout=5,
).stdout.strip()
branch = subprocess.run( # nosec B603 B607
["git", "rev-parse", "--abbrev-ref", "HEAD"],
cwd=path,
capture_output=True,
text=True,
timeout=5,
).stdout.strip()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Remove # nosec comments per coding guidelines.

The coding guidelines state: "# nosec comments are not allowed as bypasses for security checks; contributors must request formal review from @NVIDIA/modelopt-setup-codeowners with documented justification instead."

The subprocess calls here are safe (list arguments, no shell=True, hardcoded commands), but the bypass comments should be removed and the pattern formally reviewed.

🔧 Suggested fix
 def _git_info(path):
     """Get git commit hash and branch for a directory."""
-    import subprocess  # nosec B404
+    import subprocess

     try:
-        commit = subprocess.run(  # nosec B603 B607
+        commit = subprocess.run(
             ["git", "rev-parse", "--short", "HEAD"],
             cwd=path,
             capture_output=True,
             text=True,
             timeout=5,
         ).stdout.strip()
-        branch = subprocess.run(  # nosec B603 B607
+        branch = subprocess.run(
             ["git", "rev-parse", "--abbrev-ref", "HEAD"],
             cwd=path,
             capture_output=True,
             text=True,
             timeout=5,
         ).stdout.strip()

As per coding guidelines: "Disallow use of '# nosec' comments to bypass Bandit security checks."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
import subprocess # nosec B404
try:
commit = subprocess.run( # nosec B603 B607
["git", "rev-parse", "--short", "HEAD"],
cwd=path,
capture_output=True,
text=True,
timeout=5,
).stdout.strip()
branch = subprocess.run( # nosec B603 B607
["git", "rev-parse", "--abbrev-ref", "HEAD"],
cwd=path,
capture_output=True,
text=True,
timeout=5,
).stdout.strip()
import subprocess
try:
commit = subprocess.run(
["git", "rev-parse", "--short", "HEAD"],
cwd=path,
capture_output=True,
text=True,
timeout=5,
).stdout.strip()
branch = subprocess.run(
["git", "rev-parse", "--abbrev-ref", "HEAD"],
cwd=path,
capture_output=True,
text=True,
timeout=5,
).stdout.strip()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@launcher/core.py` around lines 313 - 329, Remove the inline "# nosec"
comments: delete the "# nosec B404" on the "import subprocess" line and the "#
nosec B603 B607" comments on the subprocess.run calls that set the commit and
branch variables (the lines invoking git rev-parse for commit and branch). Keep
the subprocess calls as-is (list args, no shell=True,
capture_output/text/timeout) and instead add to your PR description or a nearby
code comment a request for formal review by `@NVIDIA/modelopt-setup-codeowners`
with documented justification for why these subprocess usages are safe.

CONVERT_EXE="bash modules/Megatron-LM/examples/post_training/modelopt/convert.sh"
EXPORT_EXE="bash modules/Megatron-LM/examples/post_training/modelopt/export.sh"

export MLM_EXTRA_ARGS=${@}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix array-to-string assignment.

${@} expands to separate words but is assigned to a scalar. Use "$*" if a single string is intended, or declare MLM_EXTRA_ARGS as an array.

🔧 Suggested fix (if string is intended)
-export MLM_EXTRA_ARGS=${@}
+export MLM_EXTRA_ARGS="$*"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
export MLM_EXTRA_ARGS=${@}
export MLM_EXTRA_ARGS="$*"
🧰 Tools
🪛 Shellcheck (0.11.0)

[warning] 38-38: Assigning an array to a string! Assign as array, or use * instead of @ to concatenate.

(SC2124)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@launcher/services/megatron-lm/quantize/quantize.sh` at line 38, The
assignment to MLM_EXTRA_ARGS incorrectly expands "${@}" into a scalar; change it
to either capture all args as a single string using "$*" (if you want one
string) or make MLM_EXTRA_ARGS an array by declaring MLM_EXTRA_ARGS=("$@") (if
you need separate elements), and update subsequent uses of MLM_EXTRA_ARGS
accordingly so they treat it as a string or an array as intended.

Comment on lines +59 to +62
# Increase the modelopt version number manually
if [[ "$mpi_local_rank" -eq 0 ]]; then
echo "__version__ = '1.0.0'" >> ./modules/Model-Optimizer/modelopt/__init__.py
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Runtime modification of __init__.py is fragile and may cause race conditions.

Appending __version__ to the modelopt __init__.py at runtime:

  1. Modifies source files within a submodule, which can cause git dirty state
  2. May race with other MPI ranks despite the mpi_local_rank == 0 guard (multiple nodes each have a local rank 0)
  3. Doesn't check if __version__ already exists, causing duplicate entries on repeated runs

Consider moving version management to build-time or using a separate version file that's explicitly managed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@launcher/services/service_utils.sh` around lines 59 - 62, Replace the runtime
append of __version__ from the script: stop modifying
modules/Model-Optimizer/modelopt/__init__.py at runtime and instead write/read a
dedicated version file (e.g., modules/Model-Optimizer/VERSION) or inject the
version at build-time; remove the block that checks mpi_local_rank and echoes
"__version__ = '1.0.0'"; implement a safe creation/update step that checks for
an existing VERSION file (or existing __version__ entry) before writing and
ensure the write happens in a build/pre-deploy step or by a single global
coordinator (not per-node mpi_local_rank==0) to avoid race conditions and git
dirty state.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on making top-level folder name generic like tools inside which we have launcher?

@@ -0,0 +1,13 @@
script: services/megatron-lm/quantize/quantize.sh
args:
- --calib-dataset-path-or-name /hf-local/abisee/cnn_dailymail
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we mention in README what hf-local is? Or for simplicity we can just make it HF path abisee/cnn_dailymail


# Increase the modelopt version number manually
if [[ "$mpi_local_rank" -eq 0 ]]; then
echo "__version__ = '1.0.0'" >> ./modules/Model-Optimizer/modelopt/__init__.py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to modify ModelOpt version number?

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (4)
launcher/core.py (4)

432-455: ⚠️ Potential issue | 🟠 Major

Preserve task-level env overrides when applying defaults.

On Line 443 and Line 455, defaults overwrite task-specific values. Reverse the merge precedence so YAML task env wins.

Suggested change
-                    task_env.update(default_local_env)
+                    task_env = {**default_local_env, **task_env}
@@
-                    task_env.update(default_slurm_env)
+                    task_env = {**default_slurm_env, **task_env}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@launcher/core.py` around lines 432 - 455, The current merges call
task_env.update(default_local_env) / task_env.update(default_slurm_env) which
lets defaults overwrite task-specific YAML values; instead, apply defaults first
then layer task-level env on top so task YAML wins—e.g., obtain the appropriate
default dict for the branch (for docker: default_local_env; for slurm:
default_slurm_env), copy/merge defaults first and then update with task_env (or
update a copy of defaults with task_env) before passing to build_docker_executor
/ build_slurm_executor; target the hf_local branch and the calls around
build_docker_executor, build_slurm_executor and the task_env.update(...) lines.

482-485: ⚠️ Potential issue | 🟠 Major

Write metadata under the configured NeMo Run home, not CWD-relative.

metadata.json is currently written to a relative path on Line 482, which can diverge from the configured NEMORUN_HOME set by the launcher.

Suggested change
-def run_jobs(
+def run_jobs(
     job_table,
@@
     modelopt_src_path=None,
     base_dir=None,
+    nemorun_home=None,
 ):
@@
-        metadata_path = os.path.join("experiments", experiment_title, exp._id, "metadata.json")
+        root = nemorun_home or os.environ.get("NEMORUN_HOME", os.getcwd())
+        metadata_path = os.path.join(root, "experiments", experiment_title, exp._id, "metadata.json")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@launcher/core.py` around lines 482 - 485, The code writes metadata.json to a
CWD-relative path using metadata_path built from "experiments", experiment_title
and exp._id; change it to write under the configured NeMo Run home instead—build
metadata_path by joining the launcher’s configured NeMo Run home variable (e.g.,
NEMORUN_HOME or the instance/home attribute used by the launcher) with
"experiments", experiment_title and exp._id, ensure os.makedirs uses
os.path.dirname(metadata_path) and then json.dump metadata to that path so files
are created under the configured NeMo Run home rather than the current working
directory.

436-450: ⚠️ Potential issue | 🟠 Major

Avoid relying on run.Experiment._id private API.

Using _id couples this code to an internal nemo_run attribute and risks breakage on dependency updates.

In the currently used nemo_run version, what is the supported public API (if any) for retrieving an Experiment identifier after creation, and is `_id` considered internal-only?

Also applies to: 477-482

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@launcher/core.py` around lines 436 - 450, Replace uses of the private
attribute run.Experiment._id with the official public Experiment identifier
accessor from the nemo_run API (for example exp.id or exp.get_id(), whichever
the library exposes) everywhere it's passed (e.g., to build_slurm_executor and
the local executor call sites referenced around build_slurm_executor and the
earlier local executor block). Locate all occurrences of exp._id (including the
other block noted) and switch them to the public accessor; if the library
requires calling a method at creation time, capture and reuse that returned id
instead of reading a private field. Ensure the code compiles and tests that rely
on the experiment id are updated to use the new accessor.

321-332: ⚠️ Potential issue | 🟠 Major

Remove Bandit bypass comments from subprocess calls.

# nosec on Line 321, Line 324, and Line 331 violates repo policy even when the subprocess usage is otherwise safe.

Suggested change
-    import subprocess  # nosec B404
+    import subprocess
@@
-        commit = subprocess.run(  # nosec B603 B607
+        commit = subprocess.run(
@@
-        branch = subprocess.run(  # nosec B603 B607
+        branch = subprocess.run(

As per coding guidelines, "Disallow use of '# nosec' comments to bypass Bandit security checks."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@launcher/core.py` around lines 321 - 332, Remove the "# nosec" comments on
the subprocess import and the subprocess.run calls (the lines that set commit
and branch), and instead make the calls explicitly safe: keep the argument lists
(["git", "rev-parse", ...]) to avoid shell=True, add check=True to both
subprocess.run calls and handle subprocess.CalledProcessError (or
subprocess.SubprocessError) to log/raise a clear error on failure, and
optionally validate git exists with shutil.which("git") before invoking; update
the code around variables commit and branch to use this checked/run-and-handle
pattern so Bandit suppression comments are no longer needed.
🧹 Nitpick comments (1)
launcher/core.py (1)

133-135: Fail fast with a clear error for unknown/missing YAML factory names.

Raw KeyError from factory_lookup[function_name] is hard to diagnose for config authors.

Suggested change
-    function_name = config_from_yaml["slurm_config"].pop("_factory_")
-    slurm_config = factory_lookup[function_name](**config_from_yaml["slurm_config"])
+    function_name = config_from_yaml["slurm_config"].pop("_factory_", None)
+    if not function_name:
+        raise ValueError(f"Missing '_factory_' in slurm_config for YAML: {yaml_file}")
+    try:
+        factory = factory_lookup[function_name]
+    except KeyError as exc:
+        raise ValueError(f"Unsupported slurm_config factory '{function_name}' in {yaml_file}") from exc
+    slurm_config = factory(**config_from_yaml["slurm_config"])
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@launcher/core.py` around lines 133 - 135, The code currently assumes
config_from_yaml["slurm_config"]["_factory_"] exists and that
factory_lookup[function_name] will succeed, which yields an opaque KeyError for
config authors; update the block around function_name, factory_lookup, and
slurm_config to explicitly check for the "_factory_" key on
config_from_yaml["slurm_config"] and that function_name is present in
factory_lookup, and if not raise a clear ValueError (or custom exception) that
states the missing or unknown factory name and lists available keys from
factory_lookup so the user can correct their YAML; keep subsequent behavior of
calling factory_lookup[function_name](**config_from_yaml["slurm_config"]) and
preserving args = config_from_yaml.get("args", None).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@launcher/core.py`:
- Around line 432-455: The current merges call
task_env.update(default_local_env) / task_env.update(default_slurm_env) which
lets defaults overwrite task-specific YAML values; instead, apply defaults first
then layer task-level env on top so task YAML wins—e.g., obtain the appropriate
default dict for the branch (for docker: default_local_env; for slurm:
default_slurm_env), copy/merge defaults first and then update with task_env (or
update a copy of defaults with task_env) before passing to build_docker_executor
/ build_slurm_executor; target the hf_local branch and the calls around
build_docker_executor, build_slurm_executor and the task_env.update(...) lines.
- Around line 482-485: The code writes metadata.json to a CWD-relative path
using metadata_path built from "experiments", experiment_title and exp._id;
change it to write under the configured NeMo Run home instead—build
metadata_path by joining the launcher’s configured NeMo Run home variable (e.g.,
NEMORUN_HOME or the instance/home attribute used by the launcher) with
"experiments", experiment_title and exp._id, ensure os.makedirs uses
os.path.dirname(metadata_path) and then json.dump metadata to that path so files
are created under the configured NeMo Run home rather than the current working
directory.
- Around line 436-450: Replace uses of the private attribute run.Experiment._id
with the official public Experiment identifier accessor from the nemo_run API
(for example exp.id or exp.get_id(), whichever the library exposes) everywhere
it's passed (e.g., to build_slurm_executor and the local executor call sites
referenced around build_slurm_executor and the earlier local executor block).
Locate all occurrences of exp._id (including the other block noted) and switch
them to the public accessor; if the library requires calling a method at
creation time, capture and reuse that returned id instead of reading a private
field. Ensure the code compiles and tests that rely on the experiment id are
updated to use the new accessor.
- Around line 321-332: Remove the "# nosec" comments on the subprocess import
and the subprocess.run calls (the lines that set commit and branch), and instead
make the calls explicitly safe: keep the argument lists (["git", "rev-parse",
...]) to avoid shell=True, add check=True to both subprocess.run calls and
handle subprocess.CalledProcessError (or subprocess.SubprocessError) to
log/raise a clear error on failure, and optionally validate git exists with
shutil.which("git") before invoking; update the code around variables commit and
branch to use this checked/run-and-handle pattern so Bandit suppression comments
are no longer needed.

---

Nitpick comments:
In `@launcher/core.py`:
- Around line 133-135: The code currently assumes
config_from_yaml["slurm_config"]["_factory_"] exists and that
factory_lookup[function_name] will succeed, which yields an opaque KeyError for
config authors; update the block around function_name, factory_lookup, and
slurm_config to explicitly check for the "_factory_" key on
config_from_yaml["slurm_config"] and that function_name is present in
factory_lookup, and if not raise a clear ValueError (or custom exception) that
states the missing or unknown factory name and lists available keys from
factory_lookup so the user can correct their YAML; keep subsequent behavior of
calling factory_lookup[function_name](**config_from_yaml["slurm_config"]) and
preserving args = config_from_yaml.get("args", None).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ce5b1f59-c0be-4a83-9853-3cff952f9271

📥 Commits

Reviewing files that changed from the base of the PR and between f3d3020 and f7f9878.

📒 Files selected for processing (2)
  • launcher/core.py
  • launcher/launch.py

launch.py now only accepts pipeline=@ or --yaml. Update README with
--yaml vs pipeline=@ docs, useful flags, and bug reporting. Update
Qwen3-8B config to new --yaml format with job_name + pipeline.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
launcher/launch.py (2)

50-50: ⚠️ Potential issue | 🟠 Major

Default HF_HOME still appears inconsistent with /hf-local mounts.

get_default_env(EXPERIMENT_TITLE) yields HF_HOME=/<title>/hf-cache, while the Slurm factory mount target is /hf-local. That means cache reuse can be missed unless every task overrides HF_HOME.

🔧 Proposed fix
 DEFAULT_SLURM_ENV, DEFAULT_LOCAL_ENV = get_default_env(EXPERIMENT_TITLE)
+DEFAULT_SLURM_ENV["HF_HOME"] = "/hf-local"
+DEFAULT_LOCAL_ENV["HF_HOME"] = "/hf-local"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@launcher/launch.py` at line 50, The default HF_HOME produced by
get_default_env(EXPERIMENT_TITLE) is set to /<title>/hf-cache which does not
match the Slurm factory mount target /hf-local, causing cache reuse to be
missed; update get_default_env (or the DEFAULT_SLURM_ENV/DEFAULT_LOCAL_ENV
initialization) to set HF_HOME to the mounted path /hf-local (or derive HF_HOME
from the Slurm mount constant) so that DEFAULT_SLURM_ENV and DEFAULT_LOCAL_ENV
use the same HF_HOME as the Slurm mounts and tasks inherit the consistent cache
location.

52-62: ⚠️ Potential issue | 🟠 Major

services/* packaging may resolve from the wrong base directory.

include_pattern includes services/*, but relative_path is fully anchored to LAUNCHER_DIR. If services/ is at repo root, this silently excludes it from packaging.

🔧 Proposed fix
 packager = run.PatternPackager(
@@
-    relative_path=[LAUNCHER_DIR] * 6,
+    relative_path=[LAUNCHER_DIR] * 5 + [MODELOPT_ROOT],
 )
#!/bin/bash
set -euo pipefail

echo "== services directories in repository =="
fd -HI -td '^services$' .

echo
echo "== Does launcher/services exist? =="
if [ -d launcher/services ]; then
  echo "launcher/services exists"
else
  echo "launcher/services missing"
fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@launcher/launch.py` around lines 52 - 62, The packaging include_pattern lists
"services/*" while relative_path is set to [LAUNCHER_DIR] * 6, which will
exclude repo-root services if they live outside LAUNCHER_DIR; update the
run.PatternPackager invocation so the relative_path entry that pairs with the
"services/*" pattern points to the correct base (e.g. repo root) instead of
LAUNCHER_DIR — locate the run.PatternPackager call and change the corresponding
element in relative_path (or compute relative_path dynamically) so the
"services/*" pattern resolves from the actual services directory.
🧹 Nitpick comments (2)
launcher/launch.py (2)

74-76: Avoid unconditionally overriding caller-provided job_dir in local mode.

Line 87-Line 89 always rewrites job_dir when hf_local is set, which drops explicit user input.

🔧 Proposed refactor
 def launch(
     job_name: str = "01_job",
-    job_dir: str = os.environ.get("SLURM_JOB_DIR", os.path.expanduser("~/experiments")),
+    job_dir: str | None = None,
@@
 ) -> None:
@@
-    if hf_local is not None:
-        job_dir = os.path.join(os.getcwd(), "local_experiments")
+    if hf_local is not None:
+        job_dir = job_dir or os.path.join(os.getcwd(), "local_experiments")
+    else:
+        job_dir = job_dir or os.environ.get("SLURM_JOB_DIR", os.path.expanduser("~/experiments"))

Also applies to: 87-89

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@launcher/launch.py` around lines 74 - 76, The current behavior
unconditionally replaces the caller-provided job_dir when hf_local is true;
change this so job_dir is only auto-set when the caller didn't provide one: make
job_dir default to None in the function signature (job_dir: Optional[str] =
None) and initialize it from os.environ.get("SLURM_JOB_DIR",
os.path.expanduser("~/experiments")) only if job_dir is None, and likewise in
the hf_local branch update job_dir only when job_dir is None or empty; update
references to the job_dir variable accordingly (look for the job_dir parameter
and the hf_local handling block).

36-37: Consider using absolute imports for better package reusability.

Lines 36-37 use relative imports (core, slurm_config), which work for the documented script usage (uv run launch.py), but are fragile if the code is ever imported as a module or if entry points are added to pyproject.toml. Since launcher is a proper Python package with package infrastructure, absolute imports with fallback improve maintainability.

The proposed try-except approach is reasonable:

🔧 Suggested approach
-from core import SandboxPipeline, get_default_env, register_factory, run_jobs, set_slurm_config_type
-from slurm_config import SlurmConfig, slurm_factory
+try:
+    from launcher.core import (
+        SandboxPipeline,
+        get_default_env,
+        register_factory,
+        run_jobs,
+        set_slurm_config_type,
+    )
+    from launcher.slurm_config import SlurmConfig, slurm_factory
+except ImportError:
+    # Script-mode fallback (e.g., running from launcher/ directly)
+    from core import SandboxPipeline, get_default_env, register_factory, run_jobs, set_slurm_config_type
+    from slurm_config import SlurmConfig, slurm_factory
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@launcher/launch.py` around lines 36 - 37, Replace the fragile relative
imports in launch.py with absolute-package imports and a fallback to the
relative names: attempt to import SandboxPipeline, get_default_env,
register_factory, run_jobs, set_slurm_config_type from the top-level package
(e.g., launcher.core or package.core) and SlurmConfig, slurm_factory from the
package slurm_config, and if that fails in an ImportError, fall back to the
current relative imports; locate the import line that currently references core
and slurm_config and update it to a try/except ImportError that assigns the same
symbols (SandboxPipeline, get_default_env, register_factory, run_jobs,
set_slurm_config_type, SlurmConfig, slurm_factory) so callers elsewhere in the
file keep working regardless of whether the module is run as a script or
imported as a package.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@launcher/launch.py`:
- Line 50: The default HF_HOME produced by get_default_env(EXPERIMENT_TITLE) is
set to /<title>/hf-cache which does not match the Slurm factory mount target
/hf-local, causing cache reuse to be missed; update get_default_env (or the
DEFAULT_SLURM_ENV/DEFAULT_LOCAL_ENV initialization) to set HF_HOME to the
mounted path /hf-local (or derive HF_HOME from the Slurm mount constant) so that
DEFAULT_SLURM_ENV and DEFAULT_LOCAL_ENV use the same HF_HOME as the Slurm mounts
and tasks inherit the consistent cache location.
- Around line 52-62: The packaging include_pattern lists "services/*" while
relative_path is set to [LAUNCHER_DIR] * 6, which will exclude repo-root
services if they live outside LAUNCHER_DIR; update the run.PatternPackager
invocation so the relative_path entry that pairs with the "services/*" pattern
points to the correct base (e.g. repo root) instead of LAUNCHER_DIR — locate the
run.PatternPackager call and change the corresponding element in relative_path
(or compute relative_path dynamically) so the "services/*" pattern resolves from
the actual services directory.

---

Nitpick comments:
In `@launcher/launch.py`:
- Around line 74-76: The current behavior unconditionally replaces the
caller-provided job_dir when hf_local is true; change this so job_dir is only
auto-set when the caller didn't provide one: make job_dir default to None in the
function signature (job_dir: Optional[str] = None) and initialize it from
os.environ.get("SLURM_JOB_DIR", os.path.expanduser("~/experiments")) only if
job_dir is None, and likewise in the hf_local branch update job_dir only when
job_dir is None or empty; update references to the job_dir variable accordingly
(look for the job_dir parameter and the hf_local handling block).
- Around line 36-37: Replace the fragile relative imports in launch.py with
absolute-package imports and a fallback to the relative names: attempt to import
SandboxPipeline, get_default_env, register_factory, run_jobs,
set_slurm_config_type from the top-level package (e.g., launcher.core or
package.core) and SlurmConfig, slurm_factory from the package slurm_config, and
if that fails in an ImportError, fall back to the current relative imports;
locate the import line that currently references core and slurm_config and
update it to a try/except ImportError that assigns the same symbols
(SandboxPipeline, get_default_env, register_factory, run_jobs,
set_slurm_config_type, SlurmConfig, slurm_factory) so callers elsewhere in the
file keep working regardless of whether the module is run as a script or
imported as a package.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 07f18aab-eab7-4f3b-812c-8122649ca802

📥 Commits

Reviewing files that changed from the base of the PR and between f7f9878 and ad1f0d8.

📒 Files selected for processing (3)
  • launcher/Qwen/Qwen3-8B/megatron_lm_ptq.yaml
  • launcher/README.md
  • launcher/launch.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • launcher/README.md
  • launcher/Qwen/Qwen3-8B/megatron_lm_ptq.yaml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants