add: ModelOpt Launcher for Slurm job submission#1031
Conversation
Add launcher/ module with launch.py that submits quantization, training, and evaluation jobs to Slurm clusters via nemo-run. Produces identical code/ layout as nmm-sandbox's slurm.py so the same YAML configs work in both. Includes Megatron-LM and Model-Optimizer as submodules. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
📝 WalkthroughWalkthroughAdds a ModelOpt Launcher: new launcher package with Slurm config factory, pipeline/task dataclasses, Slurm and local Docker executors, YAML-driven orchestration, packaging integration, service scripts, docs, pyproject, and two git submodules (Megatron‑LM, Model‑Optimizer). Changes
Sequence DiagramsequenceDiagram
participant CLI as User/CLI
participant Factory as SlurmConfig\nFactory
participant Pipeline as SandboxPipeline
participant Packager as PatternPackager
participant Executor as Executor\n(Slurm/Docker)
participant Run as NemoRun\nExperiment
participant Metadata as Metadata\n(files)
CLI->>Factory: invoke factory (env & overrides)
Factory-->>CLI: SlurmConfig
CLI->>Pipeline: load task or pipeline YAML
Pipeline->>Pipeline: interpolate globals, build job_table
Pipeline-->>CLI: job_table
CLI->>Packager: prepare code tarball(s)
Packager-->>CLI: package artifacts
loop per task
CLI->>Executor: build_slurm_executor() or build_docker_executor()
Executor-->>CLI: configured executor
CLI->>Run: create Script task with env/args/executor
Run->>Run: attach to Experiment (dependencies)
end
CLI->>Run: run experiment (detach?)
Run-->>Metadata: write experiment metadata (id, job_dir, note)
Run-->>CLI: experiment results / status
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
There was a problem hiding this comment.
Is there any way to use ModelOpt from the repo root instead of adding additional submodule?
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1031 +/- ##
==========================================
- Coverage 71.73% 70.09% -1.65%
==========================================
Files 211 221 +10
Lines 23949 25541 +1592
==========================================
+ Hits 17181 17902 +721
- Misses 6768 7639 +871 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 9
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.gitmodules:
- Around line 4-6: Remove the self-referential submodule entry for
"launcher/modules/Model-Optimizer" in .gitmodules (the block with path =
launcher/modules/Model-Optimizer and url =
https://github.com/NVIDIA/Model-Optimizer.git); instead rely on the top-level
Model-Optimizer working tree for packaging/mounting and remove any code that
updates the nested gitlink for launcher/modules/Model-Optimizer so you no longer
create or expect a nested checkout.
- Around line 1-3: The .gitmodules entry for the submodule at path
"launcher/modules/Megatron-LM" currently points to the personal fork URL
"https://github.com/AAnoosheh/Megatron-LM.git"; update that submodule URL to the
canonical upstream "https://github.com/NVIDIA/Megatron-LM.git" (and then run git
submodule sync && git submodule update --init --recursive) so the
launcher/modules/Megatron-LM submodule references NVIDIA/Megatron-LM instead of
the personal fork.
In `@launcher/launch.py`:
- Around line 87-89: The launcher sets container_mounts to mount the host cache
at /hf-local (variable container_mounts, local hf_local), but the code still
defaults HF_HOME to /hf-cache so the mounted cache is ignored; update places
that set the HF_HOME default (where container_mounts is None and the other
occurrences at the blocks referenced around lines 111-122 and 330-336) to align
HF_HOME with hf_local (i.e., use the hf_local value or "/hf-local" as the
HF_HOME default) so the injected environment points HuggingFace to the actual
mounted path.
- Around line 377-379: The code sets NEMORUN_HOME via
run.config.set_nemorun_home but still writes/reads metadata.json relative to the
process CWD; update all metadata file operations to use the configured NeMo Run
home instead of os.getcwd(): replace os.getcwd()/bare relative paths with
run.config.get_nemorun_home() (or run.config.nemorun_home) when constructing
paths for metadata.json and any related experiment directories (e.g., where
metadata is written around run.config.set_nemorun_home and the similar block at
the other location), ensuring both reads and writes use the same configured
home.
- Around line 407-425: The issue is that DEFAULT_LOCAL_ENV and DEFAULT_SLURM_ENV
are applied after task_env, causing task-level environment keys to be
overwritten; in the block that builds task_env (around task.environment,
hf_local, get_docker_executor and get_slurm_executor) apply the defaults before
merging task values or merge them using defaults as base (e.g., start with a
copy of DEFAULT_LOCAL_ENV/DEFAULT_SLURM_ENV then update with task.environment)
or use setdefault semantics so that explicit task.environment entries (HF_HOME,
HF_TOKEN, MLM_SKIP_INSTALL, LAUNCH_SCRIPT) override the launcher defaults;
adjust the update order where task_env.update(DEFAULT_LOCAL_ENV) /
task_env.update(DEFAULT_SLURM_ENV) currently occur so task values take
precedence.
- Around line 171-173: The code currently looks up a callable via globals()
using function_name and invokes it—replace this with an explicit allowlist
factory map (e.g., a dict mapping allowed factory names to their functions) and
use that map to resolve the factory instead of globals(); in the block that sets
function_name and slurm_config, validate that function_name is present in the
factory map, raise a clear error if not allowed, then call the resolved factory
with slurm_config (preserving the existing use of config_from_yaml["script"],
function_name, and slurm_config variables) to produce the slurm config.
- Around line 250-270: packager is using LAUNCHER_DIR for all include_pattern
entries, causing "services/*" and "tests/*" (which live at the repository root /
MODELOPT_ROOT) to be skipped; update the packager call so the relative_path list
maps the first five patterns to LAUNCHER_DIR and the last two patterns
("services/*", "tests/*") to MODELOPT_ROOT (i.e., make relative_path =
[LAUNCHER_DIR]*5 + [MODELOPT_ROOT]*2) so the include_pattern entries resolve to
the correct directories; keep the include_pattern list unchanged and only adjust
the relative_path mapping in the packager invocation.
- Around line 417-424: Replace direct uses of the internal Experiment._id:
implement a small helper get_experiment_id(exp) that obtains the experiment id
via supported public APIs (try exp.id, then exp.get_id(), then
exp.metadata.get("id") or exp.to_dict().get("id")) and raise a clear error
instructing to update nemo_run if none exist; then call get_experiment_id(exp)
instead of exp._id in get_docker_executor(...) and get_slurm_executor(...) (and
the other occurrences referenced) so the launcher no longer depends on the
private _id attribute.
In `@launcher/modules/Megatron-LM`:
- Line 1: Document why the launcher submodule uses the personal fork
https://github.com/AAnoosheh/Megatron-LM.git (branch
aanuosh…/modelopt-example-import-order) by adding a short rationale in the repo
(e.g., README or launcher packager docs): list the exact diffs/changes required
from the fork that the launcher depends on (identify files/functions/patches),
explain why those changes cannot live in NVIDIA/Megatron-LM as-is, state whether
and when the changes will be upstreamed or removed (upstream PR IDs or
timeline), and describe the maintenance plan for this fork (how it will be
rebased/merged with upstream security fixes and who owns it). Ensure references
to the submodule URL, branch name, and the launcher packager are included so
reviewers can locate the dependency and the specific modifications.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: af7e3654-7a2f-40c2-9b19-f85cc6e8f625
📒 Files selected for processing (6)
.gitmoduleslauncher/__init__.pylauncher/launch.pylauncher/modules/Megatron-LMlauncher/modules/Model-Optimizerlauncher/pyproject.toml
| [submodule "launcher/modules/Megatron-LM"] | ||
| path = launcher/modules/Megatron-LM | ||
| url = https://github.com/AAnoosheh/Megatron-LM.git |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
What is the canonical GitHub repository for NVIDIA Megatron-LM, and is AAnoosheh/Megatron-LM an official upstream or a personal fork?
💡 Result:
- Canonical (official) NVIDIA Megatron-LM repo:
NVIDIA/Megatron-LMon GitHub. [1] (also linked from NVIDIA’s Megatron Core developer page under “Megatron-LM”). [3] - Is
AAnoosheh/Megatron-LMofficial upstream? No—GitHub shows it is “forked from NVIDIA/Megatron-LM”, so it’s a personal fork, not the official upstream repository. [2]
Sources:
[1] NVIDIA/Megatron-LM GitHub repo page (github.com)
[2] AAnoosheh/Megatron-LM GitHub repo page (shows “forked from NVIDIA/Megatron-LM”) (github.com)
[3] NVIDIA Developer “Megatron Core” page (links to Megatron-LM on GitHub) (developer.nvidia.com)
Citations:
- 1: https://github.com/NVIDIA/Megatron-LM
- 2: https://github.com/AAnoosheh/Megatron-LM
- 3: https://developer.nvidia.com/megatron-core
Replace personal fork submodule with canonical upstream.
launcher/modules/Megatron-LM points to AAnoosheh/Megatron-LM, which is a personal fork of the official NVIDIA/Megatron-LM repository. Use the canonical upstream NVIDIA/Megatron-LM instead to ensure clear provenance and maintainability of this core runtime dependency.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.gitmodules around lines 1 - 3, The .gitmodules entry for the submodule at
path "launcher/modules/Megatron-LM" currently points to the personal fork URL
"https://github.com/AAnoosheh/Megatron-LM.git"; update that submodule URL to the
canonical upstream "https://github.com/NVIDIA/Megatron-LM.git" (and then run git
submodule sync && git submodule update --init --recursive) so the
launcher/modules/Megatron-LM submodule references NVIDIA/Megatron-LM instead of
the personal fork.
| [submodule "launcher/modules/Model-Optimizer"] | ||
| path = launcher/modules/Model-Optimizer | ||
| url = https://github.com/NVIDIA/Model-Optimizer.git |
There was a problem hiding this comment.
Don't add this repository as a submodule of itself.
Pointing launcher/modules/Model-Optimizer back at NVIDIA/Model-Optimizer creates a self-referential submodule. Recursive checkouts can keep nesting launcher/modules/Model-Optimizer/..., and the launcher ends up with two competing copies of ModelOpt checked out at once.
🧭 Minimal direction for the fix
-[submodule "launcher/modules/Model-Optimizer"]
- path = launcher/modules/Model-Optimizer
- url = https://github.com/NVIDIA/Model-Optimizer.gitUse the top-level working tree for ModelOpt packaging/mounting instead of a nested checkout, and drop the matching gitlink update.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| [submodule "launcher/modules/Model-Optimizer"] | |
| path = launcher/modules/Model-Optimizer | |
| url = https://github.com/NVIDIA/Model-Optimizer.git |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.gitmodules around lines 4 - 6, Remove the self-referential submodule entry
for "launcher/modules/Model-Optimizer" in .gitmodules (the block with path =
launcher/modules/Model-Optimizer and url =
https://github.com/NVIDIA/Model-Optimizer.git); instead rely on the top-level
Model-Optimizer working tree for packaging/mounting and remove any code that
updates the nested gitlink for launcher/modules/Model-Optimizer so you no longer
create or expect a nested checkout.
launcher/launch.py
Outdated
| # Resolve paths relative to Model-Optimizer root (parent of launcher/) | ||
| LAUNCHER_DIR = os.path.dirname(os.path.abspath(__file__)) | ||
| MODELOPT_ROOT = os.path.dirname(LAUNCHER_DIR) | ||
|
|
||
| # All paths relative to LAUNCHER_DIR so code/ mirrors the launcher directory. | ||
| # This produces the same layout as nmm-sandbox's slurm.py: | ||
| # code/modules/Megatron-LM/megatron/... | ||
| # code/modules/Model-Optimizer/modelopt/... | ||
| # code/services/... | ||
| packager = run.PatternPackager( | ||
| include_pattern=[ | ||
| "modules/Megatron-LM/megatron/*", | ||
| "modules/Megatron-LM/examples/*", | ||
| "modules/Megatron-LM/*.py", | ||
| "modules/Model-Optimizer/modelopt/*", | ||
| "modules/Model-Optimizer/examples/*", | ||
| "services/*", | ||
| "tests/*", | ||
| ], | ||
| relative_path=[LAUNCHER_DIR] * 7, | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Packager block:"
sed -n '250,270p' launcher/launch.py
echo
echo "Matching directories anywhere in the repo:"
fd -HI -td '^services$|^tests$|^modules$' .
echo
echo "Matching directories specifically under launcher/:"
fd -HI -td '^services$|^tests$|^modules$' launcherRepository: NVIDIA/Model-Optimizer
Length of output: 1163
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check what's under launcher/
echo "=== Contents of launcher/ (first 2 levels) ==="
tree -L 2 launcher/ 2>/dev/null || find launcher/ -maxdepth 2 -type d | head -20
echo
echo "=== Root-level directories ==="
ls -ld */ 2>/dev/null | grep -E 'services|tests|modules' || echo "None matching pattern"
echo
echo "=== Checking if these paths exist ==="
for path in launcher/services launcher/tests launcher/modules modelopt/torch services tests; do
if [ -d "$path" ]; then
echo "✓ $path EXISTS"
else
echo "✗ $path DOES NOT EXIST"
fi
done
echo
echo "=== Root level tests/ directory (first few entries) ==="
ls -la tests/ 2>/dev/null | head -10 || echo "tests/ does not exist at root"
echo
echo "=== Checking services/ anywhere ==="
find . -maxdepth 2 -type d -name services 2>/dev/null || echo "No services/ directory found"Repository: NVIDIA/Model-Optimizer
Length of output: 1159
Critical: packager will silently miss tests and services directories.
The packager is configured to resolve services/* and tests/* under launcher/, but neither directory exists there:
launcher/services/does not exist anywhere in the repositorylauncher/tests/does not exist; the actualtests/directory is at the repository root- Only
launcher/modules/exists and contains the expected subdirectories
Since MODELOPT_ROOT is computed but unused, the patterns for services/* and tests/* should resolve relative to MODELOPT_ROOT instead of LAUNCHER_DIR to match the actual directory layout. The current configuration will silently omit these files from the remote package without error.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@launcher/launch.py` around lines 250 - 270, packager is using LAUNCHER_DIR
for all include_pattern entries, causing "services/*" and "tests/*" (which live
at the repository root / MODELOPT_ROOT) to be skipped; update the packager call
so the relative_path list maps the first five patterns to LAUNCHER_DIR and the
last two patterns ("services/*", "tests/*") to MODELOPT_ROOT (i.e., make
relative_path = [LAUNCHER_DIR]*5 + [MODELOPT_ROOT]*2) so the include_pattern
entries resolve to the correct directories; keep the include_pattern list
unchanged and only adjust the relative_path mapping in the packager invocation.
| @@ -0,0 +1 @@ | |||
| Subproject commit 1e064f361256f34bf179c0cb808fd6287538f85a | |||
There was a problem hiding this comment.
❓ Verification inconclusive
Script executed:
#!/bin/bash
# Description: Verify submodule configuration and check commit provenance
# Check .gitmodules configuration
echo "=== Submodule Configuration ==="
cat .gitmodules | grep -A 2 "Megatron-LM"
# Verify the commit exists in the personal fork
echo -e "\n=== Checking commit in personal fork ==="
git ls-remote https://github.com/AAnoosheh/Megatron-LM.git | grep "1e064f361256f34bf179c0cb808fd6287538f85a" || echo "Commit not found in default refs"
# Check if this commit exists in the official NVIDIA repository
echo -e "\n=== Checking if commit exists in official NVIDIA repo ==="
git ls-remote https://github.com/NVIDIA/Megatron-LM.git | head -20Repository: NVIDIA/Model-Optimizer
Repository: NVIDIA/Model-Optimizer
Exit code: 0
stdout:
=== Submodule Configuration ===
[submodule "launcher/modules/Megatron-LM"]
path = launcher/modules/Megatron-LM
url = https://github.com/AAnoosheh/Megatron-LM.git
[submodule "launcher/modules/Model-Optimizer"]
path = launcher/modules/Model-Optimizer
=== Checking commit in personal fork ===
1e064f361256f34bf179c0cb808fd6287538f85a refs/heads/aanoosheh/modelopt-example-import-order
=== Checking if commit exists in official NVIDIA repo ===
29e798a48da0f3e091c9364c8735f4021ea65811 HEAD
e6e6d4c79de3e40babf16d88d28dae5356ca0ae3 refs/heads/23.06_pt_23.09
375395c187ff64b8d56a1cd40572bc779864b1bd refs/heads/23.08
dd182053afc66a2195e018571537c8ce9f890df4 refs/heads/24.09-alpha
ad7d1695e00b0edd62990fe96277db0d30edf273 refs/heads/24.09-beta
53596ecae8b6e85890768f2b169fbf7911e4ed6c refs/heads/25.04-alpha
d33919011c7010e026d7c260d9d3016852cedcd2 refs/heads/25.09-alpha
06c70f44d56298a6d19dd30ff1ebeb0e85770518 refs/heads/bump-ci-container-2026-02-20-core_r0.16.0
71247489edcee7d4ca9c3e944eb82683d96c9a7e refs/heads/bump-ci-container-2026-02-23-main
7d2aeda817a4855d7a6aef1d7d65df615a018a1c refs/heads/bump-ci-container-2026-02-25-core_r0.16.0
a63f83247324f395383788301714816fb059c2b2 refs/heads/bump-ci-container-2026-02-25-main
84295f3771fc28a8626ef9735915cef21a010817 refs/heads/chcui/gpt_oss
cba6b0ed0f1ee2694ae9f7059f6d4b8a7cc995fb refs/heads/cherry-pick-2568-core_dev_r0.15.0
cba6b0ed0f1ee2694ae9f7059f6d4b8a7cc995fb refs/heads/cherry-pick-2640-core_dev_r0.15.0
082fa76a89a376c049a8998f82d87cae2bea5009 refs/heads/cherry-pick-3319-core_r0.16.0
082fa76a89a376c049a8998f82d87cae2bea5009 refs/heads/cherry-pick-3371-core_r0.16.0
5a1bf638162875f9cd296d6785e12df53fa1ad11 refs/heads/cherry-pick-3472-core_r0.16.0
955aff043676ecab128a103b9827935b0e245972 refs/heads/cherry-pick-3483-core_r0.16.0
55df528944e0bd5de6ea259cab9776ceac961801 refs/heads/cherry-pick-3583-core_r0.16.0
1b644f55f6e0ea9ed3e26c0c9c94649a59a6dd98 refs/heads/cherry-pick-3615-core_r0.16.0
Script executed:
#!/bin/bash
# Search for documentation about the Megatron-LM fork usage
echo "=== Checking for documentation or comments ==="
find . -type f \( -name "*.md" -o -name "*.txt" -o -name "*.rst" \) | xargs grep -l "Megatron" 2>/dev/null | head -10
echo -e "\n=== Checking git history for submodule addition ==="
git log --oneline --all -- .gitmodules | head -5
echo -e "\n=== Checking commit message that added Megatron-LM submodule ==="
git log --all -p -- .gitmodules | grep -A 20 "Megatron-LM" | head -30Repository: NVIDIA/Model-Optimizer
Repository: NVIDIA/Model-Optimizer
Exit code: 0
stdout:
=== Checking for documentation or comments ===
./CHANGELOG.rst
./CLAUDE.md
./CONTRIBUTING.md
./README.md
./docs/source/deployment/3_unified_hf.rst
./docs/source/getting_started/1_overview.rst
./docs/source/guides/3_pruning.rst
./docs/source/guides/7_nas.rst
./docs/source/guides/_pytorch_quantization.rst
./examples/llm_distill/README.md
=== Checking git history for submodule addition ===
d28acd3 add: ModelOpt Launcher for Slurm job submission
=== Checking commit message that added Megatron-LM submodule ===
in both. Includes Megatron-LM and Model-Optimizer as submodules.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
diff --git a/.gitmodules b/.gitmodules
new file mode 100644
index 0000000..23a5af2
--- /dev/null
+++ b/.gitmodules
@@ -0,0 +1,6 @@
+[submodule "launcher/modules/Megatron-LM"]
+ path = launcher/modules/Megatron-LM
+ url = https://github.com/AAnoosheh/Megatron-LM.git
+[submodule "launcher/modules/Model-Optimizer"]
+ path = launcher/modules/Model-Optimizer
+ url = https://github.com/NVIDIA/Model-Optimizer.git
Script executed:
#!/bin/bash
# Check documentation files for context
echo "=== README.md (first 100 lines) ==="
head -100 ./README.md
echo -e "\n=== CHANGELOG.rst (entries mentioning Megatron or launcher) ==="
grep -A 3 -B 1 -i "megatron\|launcher" ./CHANGELOG.rst | head -40
echo -e "\n=== Check launcher directory structure ==="
find ./launcher -type f -name "*.md" -o -name "*.txt" -o -name "*.py" | head -20Repository: NVIDIA/Model-Optimizer
Repository: NVIDIA/Model-Optimizer
Exit code: 0
stdout:
=== README.md (first 100 lines) ===
<div align="center">

# NVIDIA Model Optimizer
[](https://nvidia.github.io/Model-Optimizer)
[](https://pypi.org/project/nvidia-modelopt/)
[](./LICENSE)
[Documentation](https://nvidia.github.io/Model-Optimizer) |
[Roadmap](https://github.com/NVIDIA/Model-Optimizer/issues/146)
</div>
______________________________________________________________________
**NVIDIA Model Optimizer** (referred to as **Model Optimizer**, or **ModelOpt**) is a library comprising state-of-the-art model optimization [techniques](`#techniques`) including quantization, distillation, pruning, speculative decoding and sparsity to accelerate models.
**[Input]** Model Optimizer currently supports inputs of a [Hugging Face](https://huggingface.co/), [PyTorch](https://github.com/pytorch/pytorch) or [ONNX](https://github.com/onnx/onnx) model.
**[Optimize]** Model Optimizer provides Python APIs for users to easily compose the above model optimization techniques and export an optimized quantized checkpoint.
Model Optimizer is also integrated with [NVIDIA Megatron-Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge), [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) and [Hugging Face Accelerate](https://github.com/huggingface/accelerate) for training required inference optimization techniques.
**[Export for deployment]** Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like [SGLang](https://github.com/sgl-project/sglang), [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization), [TensorRT](https://github.com/NVIDIA/TensorRT), or [vLLM](https://github.com/vllm-project/vllm). The unified Hugging Face export API now supports both transformers and diffusers models.
## Latest News
- [2025/12/11] [BLOG: Top 5 AI Model Optimization Techniques for Faster, Smarter Inference](https://developer.nvidia.com/blog/top-5-ai-model-optimization-techniques-for-faster-smarter-inference/)
- [2025/12/08] NVIDIA TensorRT Model Optimizer is now officially rebranded as NVIDIA Model Optimizer.
- [2025/10/07] [BLOG: Pruning and Distilling LLMs Using NVIDIA Model Optimizer](https://developer.nvidia.com/blog/pruning-and-distilling-llms-using-nvidia-tensorrt-model-optimizer/)
- [2025/09/17] [BLOG: An Introduction to Speculative Decoding for Reducing Latency in AI Inference](https://developer.nvidia.com/blog/an-introduction-to-speculative-decoding-for-reducing-latency-in-ai-inference/)
- [2025/09/11] [BLOG: How Quantization Aware Training Enables Low-Precision Accuracy Recovery](https://developer.nvidia.com/blog/how-quantization-aware-training-enables-low-precision-accuracy-recovery/)
- [2025/08/29] [BLOG: Fine-Tuning gpt-oss for Accuracy and Performance with Quantization Aware Training](https://developer.nvidia.com/blog/fine-tuning-gpt-oss-for-accuracy-and-performance-with-quantization-aware-training/)
- [2025/08/01] [BLOG: Optimizing LLMs for Performance and Accuracy with Post-Training Quantization](https://developer.nvidia.com/blog/optimizing-llms-for-performance-and-accuracy-with-post-training-quantization/)
- [2025/06/24] [BLOG: Introducing NVFP4 for Efficient and Accurate Low-Precision Inference](https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/)
- [2025/05/14] [NVIDIA TensorRT Unlocks FP4 Image Generation for NVIDIA Blackwell GeForce RTX 50 Series GPUs](https://developer.nvidia.com/blog/nvidia-tensorrt-unlocks-fp4-image-generation-for-nvidia-blackwell-geforce-rtx-50-series-gpus/)
- [2025/04/21] [Adobe optimized deployment using Model-Optimizer + TensorRT leading to a 60% reduction in diffusion latency, a 40% reduction in total cost of ownership](https://developer.nvidia.com/blog/optimizing-transformer-based-diffusion-models-for-video-generation-with-nvidia-tensorrt/)
- [2025/04/05] [NVIDIA Accelerates Inference on Meta Llama 4 Scout and Maverick](https://developer.nvidia.com/blog/nvidia-accelerates-inference-on-meta-llama-4-scout-and-maverick/). Check out how to quantize Llama4 for deployment acceleration [here](./examples/llm_ptq/README.md#llama-4)
- [2025/03/18] [World's Fastest DeepSeek-R1 Inference with Blackwell FP4 & Increasing Image Generation Efficiency on Blackwell](https://developer.nvidia.com/blog/nvidia-blackwell-delivers-world-record-deepseek-r1-inference-performance/)
- [2025/02/25] Model Optimizer quantized NVFP4 models available on Hugging Face for download: [DeepSeek-R1-FP4](https://huggingface.co/nvidia/DeepSeek-R1-FP4), [Llama-3.3-70B-Instruct-FP4](https://huggingface.co/nvidia/Llama-3.3-70B-Instruct-FP4), [Llama-3.1-405B-Instruct-FP4](https://huggingface.co/nvidia/Llama-3.1-405B-Instruct-FP4)
- [2025/01/28] Model Optimizer has added support for NVFP4. Check out an example of NVFP4 PTQ [here](./examples/llm_ptq/README.md#model-quantization-and-trt-llm-conversion).
- [2025/01/28] Model Optimizer is now open source!
<details close>
<summary>Previous News</summary>
- [2024/10/23] Model Optimizer quantized FP8 Llama-3.1 Instruct models available on Hugging Face for download: [8B](https://huggingface.co/nvidia/Llama-3.1-8B-Instruct-FP8), [70B](https://huggingface.co/nvidia/Llama-3.1-70B-Instruct-FP8), [405B](https://huggingface.co/nvidia/Llama-3.1-405B-Instruct-FP8).
- [2024/09/10] [Post-Training Quantization of LLMs with NVIDIA NeMo and Model Optimizer](https://developer.nvidia.com/blog/post-training-quantization-of-llms-with-nvidia-nemo-and-nvidia-tensorrt-model-optimizer/).
- [2024/08/28] [Boosting Llama 3.1 405B Performance up to 44% with Model Optimizer on NVIDIA H200 GPUs](https://developer.nvidia.com/blog/boosting-llama-3-1-405b-performance-by-up-to-44-with-nvidia-tensorrt-model-optimizer-on-nvidia-h200-gpus/)
- [2024/08/28] [Up to 1.9X Higher Llama 3.1 Performance with Medusa](https://developer.nvidia.com/blog/low-latency-inference-chapter-1-up-to-1-9x-higher-llama-3-1-performance-with-medusa-on-nvidia-hgx-h200-with-nvlink-switch/)
- [2024/08/15] New features in recent releases: [Cache Diffusion](./examples/diffusers/cache_diffusion), [QLoRA workflow with NVIDIA NeMo](https://docs.nvidia.com/nemo-framework/user-guide/24.09/sft_peft/qlora.html), and more. Check out [our blog](https://developer.nvidia.com/blog/nvidia-tensorrt-model-optimizer-v0-15-boosts-inference-performance-and-expands-model-support/) for details.
- [2024/06/03] Model Optimizer now has an experimental feature to deploy to vLLM as part of our effort to support popular deployment frameworks. Check out the workflow [here](./examples/llm_ptq/README.md#deploy-fp8-quantized-model-using-vllm)
- [2024/05/08] [Announcement: Model Optimizer Now Formally Available to Further Accelerate GenAI Inference Performance](https://developer.nvidia.com/blog/accelerate-generative-ai-inference-performance-with-nvidia-tensorrt-model-optimizer-now-publicly-available/)
- [2024/03/27] [Model Optimizer supercharges TensorRT-LLM to set MLPerf LLM inference records](https://developer.nvidia.com/blog/nvidia-h200-tensor-core-gpus-and-nvidia-tensorrt-llm-set-mlperf-llm-inference-records/)
- [2024/03/18] [GTC Session: Optimize Generative AI Inference with Quantization in TensorRT-LLM and TensorRT](https://www.nvidia.com/en-us/on-demand/session/gtc24-s63213/)
- [2024/03/07] [Model Optimizer's 8-bit Post-Training Quantization enables TensorRT to accelerate Stable Diffusion to nearly 2x faster](https://developer.nvidia.com/blog/tensorrt-accelerates-stable-diffusion-nearly-2x-faster-with-8-bit-post-training-quantization/)
- [2024/02/01] [Speed up inference with Model Optimizer quantization techniques in TRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/quantization-in-TRT-LLM.md)
</details>
## Install
To install stable release packages for Model Optimizer with `pip` from [PyPI](https://pypi.org/project/nvidia-modelopt/):
```bash
pip install -U nvidia-modelopt[all]
```
To install from source in editable mode with all development dependencies or to use the latest features, run:
```bash
# Clone the Model Optimizer repository
git clone git@github.com:NVIDIA/Model-Optimizer.git
cd Model-Optimizer
pip install -e .[dev]
```
You can also directly use the [TensorRT-LLM docker images](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags)
(e.g., `nvcr.io/nvidia/tensorrt-llm/release:<version>`), which have Model Optimizer pre-installed.
Make sure to upgrade Model Optimizer to the latest version as described above.
Visit our [installation guide](https://nvidia.github.io/Model-Optimizer/getting_started/2_installation.html) for
more fine-grained control on installed dependencies or for alternative docker images and environment variables to setup.
## Techniques
<div align="center">
| **Technique** | **Description** | **Examples** | **Docs** |
| :------------: | :------------: | :------------: | :------------: |
| Post Training Quantization | Compress model size by 2x-4x, speeding up inference while preserving model quality! | \[[LLMs](./examples/llm_ptq/)\] \[[diffusers](./examples/diffusers/)\] \[[VLMs](./examples/vlm_ptq/)\] \[[onnx](./examples/onnx_ptq/)\] \[[windows](./examples/windows/)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/1_quantization.html)\] |
| Quantization Aware Training | Refine accuracy even further with a few training steps! | \[[NeMo](./examples/llm_qat#nemo-qatqad-simplified-flow-example)\] \[[Hugging Face](./examples/llm_qat/)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/1_quantization.html)\] |
| Pruning | Reduce your model size and accelerate inference by removing unnecessary weights! | \[[PyTorch](./examples/pruning/)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/3_pruning.html)\] |
| Distillation | Reduce deployment model size by teaching small models to behave like larger models! | \[[NeMo](./examples/llm_distill#knowledge-distillation-kd-for-nvidia-nemo-models)\] \[[Hugging Face](./examples/llm_distill/)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/4_distillation.html)\] |
| Speculative Decoding | Train draft modules to predict extra tokens during inference! | \[[Megatron](./examples/speculative_decoding#mlm-example)\] \[[Hugging Face](./examples/speculative_decoding/)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/5_speculative_decoding.html)\] |
| Sparsity | Efficiently compress your model by storing only its non-zero parameter values and their locations | \[[PyTorch](./examples/llm_sparsity/)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/6_sparsity.html)\] |
</div>
=== CHANGELOG.rst (entries mentioning Megatron or launcher) ===
- Add support for ``params`` constraint based automatic neural architecture search in Minitron pruning (``mcore_minitron``) as an alternative to manual pruning (using ``export_config``). See `examples/pruning/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/pruning>`_ for more details on its usage.
- New example for Minitron pruning with Megatron-Bridge framework along with advanced pruning usage with new ``params`` constraint based pruning. Also add example for distillation with Megatron-Bridge framework. Check `examples/megatron_bridge/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/megatron_bridge>`_ for example scripts.
- Add support for calibration data with multiple samples in ``npz`` format in the ONNX Autocast workflow.
- Add ``--opset`` option to ONNX quantization CLI to specify the target opset version for the quantized model.
- Add support for context parallelism in Eagle speculative decoding for huggingface and megatron core models.
- Add unified Hugging Face export support for diffusers pipelines/components.
- Add LTX-2 and Wan2.2 (T2V) support in the diffusers quantization workflow.
- Add PTQ support for GLM-4.7, including loading MTP layer weights from a separate ``mtp.safetensors`` file and export as-is.
--
- Fix Megatron KV Cache quantization checkpoint restore for QAT/QAD (device placement, amax sync across DP/TP, flash_decode compatibility).
**New Features**
- Add support for Transformer Engine quantization for Megatron Core models.
- Add support for Qwen3-Next model quantization.
- Add support for dynamically linked TensorRT plugins in the ONNX quantization workflow.
- Add support for KV Cache Quantization for vLLM FakeQuant PTQ script. See `examples/vllm_serve/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/vllm_serve#Calibrate-and-serve-fake-quant-model-in-vLLM>`__ for more details.
--
- Add support to enable custom emulated quantization backend. See :meth:`register_quant_backend <modelopt.torch.quantization.nn.modules.tensor_quantizer.register_quant_backend>`` for more details. See an example in ``tests/unit/torch/quantization/test_custom_backend.py``.
- Add ``examples/llm_qad`` for QAD training with Megatron-LM.
**Deprecations**
--
- Add ``specdec_bench`` example to benchmark speculative decoding performance. See `examples/specdec_bench/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/specdec_bench#speculative-decoding-benchmark>`_ for more details.
- Add FP8/NVFP4 KV cache quantization support for Megatron Core models.
- Add KL Divergence loss based auto_quantize method. See `auto_quantize API docs <https://nvidia.github.io/Model-Optimizer/reference/generated/modelopt.torch.quantization.model_quant.html#modelopt.torch.quantization.model_quant.auto_quantize>`_ for more details.
- Add support for saving and resuming auto_quantize search state. This speeds up the auto_quantize process by skipping the score estimation step if the search state is provided.
- Add flag ``trt_plugins_precision`` in ONNX autocast to indicate custom ops precision. This is similar to the flag already existing in the quantization workflow.
--
- Deprecate ``examples/megatron-lm`` in favor of more detailed documentation in `Megatron-LM/examples/post_training/modelopt <https://github.com/NVIDIA/Megatron-LM/tree/main/examples/post_training/modelopt>`_.
**Misc**
--
- Support storing and restoring Minitron pruning activations and scores for re-pruning without running the forward loop again.
- Add Minitron pruning example for Megatron-LM framework. See `Megatron-LM/examples/post_training/modelopt <https://github.com/NVIDIA/Megatron-LM/tree/main/examples/post_training/modelopt>`_ for more details.
=== Check launcher directory structure ===
./launcher/__init__.py
./launcher/launch.py
Script executed:
#!/bin/bash
# Check launcher code to understand dependencies on Megatron-LM fork
echo "=== Launcher launch.py content ==="
cat ./launcher/launch.py
echo -e "\n=== Check if launcher references Megatron-LM ==="
grep -r "Megatron" ./launcher/ 2>/dev/null || echo "No direct Megatron references found"
echo -e "\n=== Check launcher __init__.py ==="
cat ./launcher/__init__.py
echo -e "\n=== Check if Megatron-LM submodule is actually imported/used ==="
rg "launcher/modules/Megatron-LM" --type py --type md --type rst 2>/dev/null || echo "No references to submodule path found"
echo -e "\n=== Check recent commit on the fork branch ==="
git ls-remote https://github.com/AAnoosheh/Megatron-LM.git refs/heads/aanoosheh/modelopt-example-import-orderRepository: NVIDIA/Model-Optimizer
Repository: NVIDIA/Model-Optimizer
Exit code: 0
stdout:
=== Launcher launch.py content ===
# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""ModelOpt Launcher — submit quantization, training, and evaluation jobs to Slurm clusters.
Usage:
uv run launch.py task=@configs/quantize/Qwen3-8B.yaml --yes
uv run launch.py pipeline=@configs/pipeline/eagle3.yaml --yes
uv run launch.py task=@configs/quantize/Qwen3-8B.yaml hf_local=/mnt/hf-local --yes
Environment variables:
SLURM_HOST Slurm login node hostname (required for remote jobs)
SLURM_ACCOUNT Slurm account/partition billing (default: from YAML)
SLURM_JOB_DIR Remote directory for job artifacts
SLURM_HF_LOCAL Path to HuggingFace model cache on the cluster
HF_TOKEN HuggingFace API token
NEMORUN_HOME NeMo Run home directory (default: current working directory)
"""
import dataclasses
import getpass
import json
import os
import re
import warnings
from dataclasses import dataclass
import nemo_run as run
import yaml
# ---------------------------------------------------------------------------
# Slurm configuration
# ---------------------------------------------------------------------------
`@dataclass`
class SlurmConfig:
"""Cluster-agnostic Slurm configuration.
Users define cluster details in their YAML configs or override via CLI.
No internal cluster defaults are embedded here.
"""
host: str | None = None
port: int = 22
account: str | None = None
partition: str = "batch"
container: str | None = None
modelopt_install_path: str = "/usr/local/lib/python3.12/dist-packages/modelopt"
container_mounts: list[str] | None = None
srun_args: list[str] | None = None
array: str | None = None
nodes: int = 1
ntasks_per_node: int = 1
gpus_per_node: int = 1
local: bool = False
`@run.cli.factory`
`@run.autoconvert`
def slurm_factory(
host: str = os.environ.get("SLURM_HOST", ""),
account: str = os.environ.get("SLURM_ACCOUNT", ""),
partition: str = "batch",
nodes: int = 1,
ntasks_per_node: int = 1,
gpus_per_node: int = 1,
container: str = "nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc5",
modelopt_install_path: str = "/usr/local/lib/python3.12/dist-packages/modelopt",
container_mounts: list[str] | None = None,
srun_args: list[str] | None = None,
array: str | None = None,
) -> SlurmConfig:
"""Generic Slurm factory — configure via environment variables or CLI overrides."""
if container_mounts is None:
hf_local = os.environ.get("SLURM_HF_LOCAL", "/hf-local")
container_mounts = ["{}:/hf-local".format(hf_local)]
if srun_args is None:
srun_args = ["--no-container-mount-home"]
return SlurmConfig(
host=host,
account=account,
partition=partition,
nodes=nodes,
ntasks_per_node=ntasks_per_node,
gpus_per_node=gpus_per_node,
container=container,
modelopt_install_path=modelopt_install_path,
container_mounts=container_mounts,
srun_args=srun_args,
array=array,
)
# ---------------------------------------------------------------------------
# Default environment variables injected into every job
# ---------------------------------------------------------------------------
DEFAULT_SLURM_ENV = {
"HF_HOME": "/hf-cache",
"HF_TOKEN": os.getenv("HF_TOKEN", ""),
"MLM_SKIP_INSTALL": "1",
"LAUNCH_SCRIPT": "python",
}
DEFAULT_LOCAL_ENV = {
"HF_HOME": "/hf-cache",
"HF_TOKEN": os.getenv("HF_TOKEN", ""),
"MLM_SKIP_INSTALL": "1",
}
# ---------------------------------------------------------------------------
# Task and pipeline dataclasses
# ---------------------------------------------------------------------------
`@dataclass`
class SandboxTask:
"""A single task with a script, slurm config, args, and environment."""
script: str = None
slurm_config: SlurmConfig = None
args: list[str] = None
environment: list[dict] = None
yaml_file: str = None
`@dataclass`
class SandboxTask0(SandboxTask):
"""Task slot 0 in a pipeline."""
`@dataclass`
class SandboxTask1(SandboxTask):
"""Task slot 1 in a pipeline."""
`@dataclass`
class SandboxTask2(SandboxTask):
"""Task slot 2 in a pipeline."""
`@dataclass`
class SandboxTask3(SandboxTask):
"""Task slot 3 in a pipeline."""
`@dataclass`
class SandboxTask4(SandboxTask):
"""Task slot 4 in a pipeline."""
def create_task_from_yaml(yaml_file: str) -> SandboxTask:
"""Create a SandboxTask from a YAML config file."""
with open(yaml_file) as file:
config_from_yaml = yaml.safe_load(file)
script = config_from_yaml["script"]
function_name = config_from_yaml["slurm_config"].pop("_factory_")
slurm_config = globals()[function_name](**config_from_yaml["slurm_config"])
args = config_from_yaml.get("args", None)
environment = config_from_yaml.get("environment", None)
return SandboxTask(script=script, slurm_config=slurm_config, args=args, environment=environment)
`@dataclass`
class GlobalVariables:
"""Shared variables for <<global_vars.X>> interpolation in pipeline YAMLs."""
hf_model: str = None
hf_data: str = None
`@dataclass`
class SandboxPipeline:
"""A multi-task pipeline with shared global variables and task dependencies."""
global_vars: GlobalVariables = None
task_0: SandboxTask0 = None
task_1: SandboxTask1 = None
task_2: SandboxTask2 = None
task_3: SandboxTask3 = None
task_4: SandboxTask4 = None
tasks: list[SandboxTask] = None
test_level: int = 0
allow_to_fail: bool = False
skip: bool = False
note: str = ""
task_configs: list[str] = None
experiment = None
def __post_init__(self):
if self.tasks is None:
self.tasks = []
for i in range(5):
task = getattr(self, "task_{}".format(i), None)
if task is not None:
self.tasks += [task]
if self.task_configs is not None:
self.tasks += [
create_task_from_yaml(yaml_file=yaml_file) for yaml_file in self.task_configs
]
if self.global_vars is not None:
global_vars_dict = {
k: v for k, v in dataclasses.asdict(self.global_vars).items() if v is not None
}
def _resolve(s):
if not isinstance(s, str):
return s
return re.sub(
r"<<global_vars\.(\w+)>>",
lambda m: global_vars_dict.get(m.group(1), m.group(0)),
s,
)
for task in self.tasks:
if task.environment:
if isinstance(task.environment, list):
task.environment = [
{k: _resolve(v) for k, v in item.items()} for item in task.environment
]
else:
task.environment = {k: _resolve(v) for k, v in task.environment.items()}
if task.args:
task.args = [_resolve(a) for a in task.args]
# ---------------------------------------------------------------------------
# Code packager — sync only the necessary source trees to the cluster
# ---------------------------------------------------------------------------
# Resolve paths relative to Model-Optimizer root (parent of launcher/)
LAUNCHER_DIR = os.path.dirname(os.path.abspath(__file__))
MODELOPT_ROOT = os.path.dirname(LAUNCHER_DIR)
# All paths relative to LAUNCHER_DIR so code/ mirrors the launcher directory.
# This produces the same layout as nmm-sandbox's slurm.py:
# code/modules/Megatron-LM/megatron/...
# code/modules/Model-Optimizer/modelopt/...
# code/services/...
packager = run.PatternPackager(
include_pattern=[
"modules/Megatron-LM/megatron/*",
"modules/Megatron-LM/examples/*",
"modules/Megatron-LM/*.py",
"modules/Model-Optimizer/modelopt/*",
"modules/Model-Optimizer/examples/*",
"services/*",
"tests/*",
],
relative_path=[LAUNCHER_DIR] * 7,
)
# ---------------------------------------------------------------------------
# Executor builders
# ---------------------------------------------------------------------------
def get_slurm_executor(user, identity, slurm_config, experiment_id, job_dir, task_name):
"""Build a SlurmExecutor for remote job submission."""
container_mounts = slurm_config.container_mounts or []
scratch_dst = "/scratchspace"
scratch_src = job_dir + "/cicd/" + experiment_id
modelopt_dst = slurm_config.modelopt_install_path
modelopt_src = (
job_dir
+ "/cicd/"
+ experiment_id
+ "/{}/code/modules/Model-Optimizer/modelopt".format(task_name)
)
container_mounts = [
*container_mounts,
scratch_src + ":" + scratch_dst,
modelopt_src + ":" + modelopt_dst,
]
tunnel = run.SSHTunnel(
host=slurm_config.host,
user=getpass.getuser() if user is None else user,
port=slurm_config.port,
job_dir=job_dir,
identity=identity,
)
executor = run.SlurmExecutor(
account=slurm_config.account,
partition=slurm_config.partition,
ntasks_per_node=slurm_config.ntasks_per_node,
gpus_per_node=slurm_config.gpus_per_node,
nodes=slurm_config.nodes,
tunnel=tunnel,
container_image=slurm_config.container,
container_mounts=container_mounts,
array=slurm_config.array,
time="04:00:00",
mem="0",
retries=0,
packager=packager,
srun_args=slurm_config.srun_args,
)
return executor
def get_docker_executor(hf_local, slurm_config, experiment_id, job_dir, task_name):
"""Build a DockerExecutor for local GPU jobs."""
if slurm_config.local:
container_mounts = list(slurm_config.container_mounts or [])
else:
container_mounts = []
container_mounts += [hf_local + ":/hf-local", job_dir + "/cicd:/cicd"]
scratch_dst = "/scratchspace"
scratch_src = job_dir + "/cicd/" + experiment_id + "/" + task_name
modelopt_dst = slurm_config.modelopt_install_path
modelopt_src = os.path.join(LAUNCHER_DIR, "modules/Model-Optimizer/modelopt")
container_mounts += [scratch_src + ":" + scratch_dst, modelopt_src + ":" + modelopt_dst]
executor = run.DockerExecutor(
num_gpus=-1,
runtime="nvidia",
ipc_mode="host",
container_image=slurm_config.container,
volumes=container_mounts,
additional_kwargs={"user": "{}:{}".format(os.getuid(), os.getgid())},
packager=packager,
)
return executor
# ---------------------------------------------------------------------------
# Main entrypoint
# ---------------------------------------------------------------------------
`@run.cli.entrypoint`
def launch(
ctx: run.cli.RunContext,
job_name: str = "01_job",
job_dir: str = os.environ.get("SLURM_JOB_DIR", os.path.expanduser("~/experiments")),
task: SandboxTask | None = None,
pipeline: SandboxPipeline | None = None,
hf_local: str | None = None,
user: str = getpass.getuser(),
identity: str | None = None,
) -> None:
"""Launch ModelOpt jobs on Slurm or locally with Docker.
Args:
job_name: Name of the job.
job_dir: Remote directory for job artifacts.
task: Single task config (from YAML).
pipeline: Multi-task pipeline config (from YAML).
hf_local: Path to local HF cache (enables local Docker execution).
user: SSH user for Slurm tunnel.
identity: SSH identity file for Slurm tunnel.
"""
if "NEMORUN_HOME" not in os.environ:
warnings.warn("NEMORUN_HOME is not set. Defaulting to current working directory.")
run.config.set_nemorun_home(os.environ.get("NEMORUN_HOME", os.getcwd()))
if hf_local is not None:
job_dir = os.getcwd() + "/experiments"
job_table = {}
if task is not None:
job_table[job_name] = SandboxPipeline(tasks=[task])
elif pipeline is not None:
job_table[job_name] = pipeline
else:
print("No task or pipeline provided. Use task=@<yaml> or pipeline=@<yaml>.")
return
for job_name, job in job_table.items(): # noqa: PLR1704
if job.skip:
continue
dependency = None
exp = run.Experiment("modelopt", log_level="INFO")
job.experiment = exp
with exp:
for task_id, task in enumerate(job.tasks): # noqa: PLR1704
task_name = job_name + "_" + str(task_id)
task_args = [] if task.args is None else task.args
task_env = {}
if task.environment is not None:
if isinstance(task.environment, list):
for item in task.environment:
task_env.update(item.items())
else:
task_env = task.environment
for k, v in task_env.items():
task_env[k] = "" if v is None else str(v)
if hf_local is not None:
executor = get_docker_executor(
hf_local, task.slurm_config, exp._id, job_dir, task_name
)
task_env.update(DEFAULT_LOCAL_ENV)
else:
executor = get_slurm_executor(
user, identity, task.slurm_config, exp._id, job_dir, task_name
)
task_env.update(DEFAULT_SLURM_ENV)
task_instance = run.Script(task.script, args=task_args, env=task_env)
print(
"job {} task {} slurm_config: {}".format(job_name, task_id, task.slurm_config)
)
if dependency is None:
dependency = exp.add(
task_instance, tail_logs=True, name=task_name, executor=executor
)
else:
dependency = exp.add(
task_instance,
tail_logs=True,
name=task_name,
executor=executor,
dependencies=[dependency],
)
exp.run(detach=ctx.detach)
# Write metadata for downstream tools
metadata = {
"experiment_id": exp._id,
"job_name": job_name,
"allow_to_fail": job.allow_to_fail,
"note": job.note,
}
metadata_path = os.path.join("experiments", "modelopt", exp._id, "metadata.json")
os.makedirs(os.path.dirname(metadata_path), exist_ok=True)
with open(metadata_path, "w") as f:
json.dump(metadata, f)
if __name__ == "__main__":
run.cli.main(launch)
=== Check if launcher references Megatron-LM ===
./launcher/launch.py:# code/modules/Megatron-LM/megatron/...
./launcher/launch.py: "modules/Megatron-LM/megatron/*",
./launcher/launch.py: "modules/Megatron-LM/examples/*",
./launcher/launch.py: "modules/Megatron-LM/*.py",
=== Check launcher __init__.py ===
# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""ModelOpt Launcher — submit quantization, training, and evaluation jobs to Slurm clusters."""
=== Check if Megatron-LM submodule is actually imported/used ===
No references to submodule path found
=== Check recent commit on the fork branch ===
1e064f361256f34bf179c0cb808fd6287538f85a refs/heads/aanoosheh/modelopt-example-import-order
Clarify the rationale for using a personal fork of Megatron-LM in the launcher submodule.
The submodule points to https://github.com/AAnoosheh/Megatron-LM.git (a personal fork) rather than the official NVIDIA repository. The launcher's packager includes this submodule to sync code to clusters for job execution. While the fork branch (aanoosheh/modelopt-example-import-order) suggests intentional, scoped changes, the rationale is undocumented.
Please provide:
- What specific changes in the fork are necessary for the launcher, and why they cannot be in the official
NVIDIA/Megatron-LM? - Is there a plan to upstream these changes or transition to the official repository?
- How is this fork maintained and kept in sync with upstream security patches?
Without this documentation, this dependency represents a maintainability and supply chain risk. If these changes are general improvements, they should be contributed back to the official repository.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@launcher/modules/Megatron-LM` at line 1, Document why the launcher submodule
uses the personal fork https://github.com/AAnoosheh/Megatron-LM.git (branch
aanuosh…/modelopt-example-import-order) by adding a short rationale in the repo
(e.g., README or launcher packager docs): list the exact diffs/changes required
from the fork that the launcher depends on (identify files/functions/patches),
explain why those changes cannot live in NVIDIA/Megatron-LM as-is, state whether
and when the changes will be upstreamed or removed (upstream PR IDs or
timeline), and describe the maintenance plan for this fork (how it will be
rebased/merged with upstream security fixes and who owns it). Ensure references
to the submodule URL, branch name, and the launcher packager are included so
reviewers can locate the dependency and the specific modifications.
Extract shared logic (dataclasses, executor builders, run loop, version reporting) into core.py. Both launch.py and nmm-sandbox's slurm.py import from core.py to avoid divergence. Add slurm_config.py with generic env-var-driven factory, service scripts, Qwen3-8B PTQ example, and README with usage, flags, and bug reporting instructions. Verified: same YAML produces identical MMLU 0.736 on OCI-HSG and 0.719 locally via both slurm.py and launch.py. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (7)
launcher/services/megatron-lm/quantize/quantize.sh (2)
46-47:exit_handleris called with an argument it doesn't accept.
exit_handlerinservice_utils.shtakes no parameters, but it's called with$0. This is harmless but inconsistent with its definition.🔧 Suggested fix
# This function handles the exit status (fails the CI). -exit_handler $0 +exit_handler🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@launcher/services/megatron-lm/quantize/quantize.sh` around lines 46 - 47, The call to exit_handler in quantize.sh passes an unused argument ($0) even though exit_handler (defined in service_utils.sh) accepts no parameters; update the call in quantize.sh to invoke exit_handler with no arguments (replace "exit_handler $0" with "exit_handler") so it matches the function signature and removes the inconsistency while keeping behavior unchanged.
33-36: Remove or document unusedCONVERT_EXEandEXPORT_EXE.These variables are defined but never used. If they're placeholders for future workflow steps, add a comment; otherwise remove them to reduce confusion.
🔧 Suggested fix
QUANTIZE_EXE="bash modules/Megatron-LM/examples/post_training/modelopt/quantize.sh" MMLU_EXE="bash modules/Megatron-LM/examples/post_training/modelopt/mmlu.sh" -CONVERT_EXE="bash modules/Megatron-LM/examples/post_training/modelopt/convert.sh" -EXPORT_EXE="bash modules/Megatron-LM/examples/post_training/modelopt/export.sh" +# TODO: Add convert and export steps +# CONVERT_EXE="bash modules/Megatron-LM/examples/post_training/modelopt/convert.sh" +# EXPORT_EXE="bash modules/Megatron-LM/examples/post_training/modelopt/export.sh"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@launcher/services/megatron-lm/quantize/quantize.sh` around lines 33 - 36, CONVERT_EXE and EXPORT_EXE are defined but never used; either remove their definitions or document them as placeholders. Edit the file to either delete the lines defining CONVERT_EXE and EXPORT_EXE, or add a brief comment above them stating they are intentional placeholders for future steps (e.g., "placeholder for future convert/export scripts") and keep the variables if needed; leave QUANTIZE_EXE and MMLU_EXE unchanged. Ensure the chosen change keeps the script consistent (no unused vars unless explicitly documented).launcher/services/service_utils.sh (2)
24-25: Remove or use theFAILvariable.
FAILis set inerror_handler(line 34) but never read—onlyFAIL_EXITis used. Either removeFAILor export it if external scripts rely on it.🔧 Suggested fix
-FAIL=0 FAIL_EXIT=0Or if needed externally:
-FAIL=0 +export FAIL=0🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@launcher/services/service_utils.sh` around lines 24 - 25, The variable FAIL is assigned but never read; update the code path around the error_handler and variable declarations to either remove the unused FAIL variable or export/consume it where intended: if other scripts rely on it, export FAIL (e.g., export FAIL) and ensure error_handler sets it; otherwise delete FAIL and only keep FAIL_EXIT. Locate the declarations FAIL and FAIL_EXIT and the error_handler function to make the change.
50-54: Pin thediskcacheversion for reproducibility.Installing
diskcachewithout a version specifier can lead to non-deterministic builds and potential compatibility issues.🔧 Suggested fix
function util_install_extra_dep { if [[ "$mpi_local_rank" -eq 0 ]]; then - pip install diskcache + pip install 'diskcache>=5.6,<6.0' fi }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@launcher/services/service_utils.sh` around lines 50 - 54, The pip install call in function util_install_extra_dep currently installs diskcache without a version; update util_install_extra_dep to pin diskcache to a specific, tested version (e.g., replace the pip install diskcache invocation with a pinned install like pip install diskcache==<VERSION>) or read the version from a single constant/variable (e.g., DISKCACHE_VERSION) so builds are deterministic; keep the mpi_local_rank guard unchanged.launcher/pyproject.toml (1)
6-9: Consider pinningpyyamlversion for reproducibility.
nemo-runis pinned to a specific commit hash, butpyyamlis unpinned. For consistent, reproducible builds, consider pinning to a specific version (e.g.,pyyaml>=6.0,<7.0or an exact version).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@launcher/pyproject.toml` around lines 6 - 9, The dependency list in pyproject.toml pins "nemo-run" to a commit but leaves "pyyaml" unpinned, risking non-reproducible builds; update the dependencies array to pin "pyyaml" to a specific range or exact version (for example change the "pyyaml" entry in the dependencies list to something like "pyyaml>=6.0,<7.0" or an exact "pyyaml==6.0.1") so that the dependencies block in pyproject.toml deterministically resolves.launcher/slurm_config.py (2)
58-61: Mutable default arguments in function signature.
container_mountsandsrun_argsuse list literals as default values. In Python, mutable defaults are evaluated once at function definition time and shared across all calls, which can lead to unexpected behavior if the lists are modified.🔧 Suggested fix
def slurm_factory( host: str = os.environ.get("SLURM_HOST", ""), account: str = os.environ.get("SLURM_ACCOUNT", ""), partition: str = "batch", nodes: int = 1, ntasks_per_node: int = 1, gpus_per_node: int = 1, container: str = "nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc5", modelopt_install_path: str = "/usr/local/lib/python3.12/dist-packages/modelopt", - container_mounts: list[str] = [ - "{}:/hf-local".format(os.environ.get("SLURM_HF_LOCAL", "/hf-local")), - ], - srun_args: list[str] = ["--no-container-mount-home"], + container_mounts: list[str] | None = None, + srun_args: list[str] | None = None, array: str = None, # noqa: RUF013 ) -> SlurmConfig: """Generic Slurm factory — configure via environment variables or CLI overrides.""" + if container_mounts is None: + container_mounts = [ + "{}:/hf-local".format(os.environ.get("SLURM_HF_LOCAL", "/hf-local")), + ] + if srun_args is None: + srun_args = ["--no-container-mount-home"] return SlurmConfig(🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@launcher/slurm_config.py` around lines 58 - 61, The defaults for container_mounts and srun_args are currently mutable list literals; change their default values to None in the function or constructor signature (e.g., container_mounts: Optional[list[str]] = None, srun_args: Optional[list[str]] = None) and inside the function initialize them to fresh lists if None (create the container_mounts list using "{}:/hf-local".format(os.environ.get("SLURM_HF_LOCAL", "/hf-local")) and set srun_args to ["--no-container-mount-home"]) so each call gets its own list and avoids shared mutable state; update any references to these parameters accordingly (look for container_mounts and srun_args in the function/class where they are defined).
32-44: Type hints don't match default values.Several fields have type hints (e.g.,
str,list[str]) but default toNone. Consider using union types for clarity:🔧 Suggested fix
`@dataclass` class SlurmConfig: - host: str = None + host: str | None = None port: int = 22 - account: str = None + account: str | None = None partition: str = "batch" - container: str = None + container: str | None = None modelopt_install_path: str = "/usr/local/lib/python3.12/dist-packages/modelopt" - container_mounts: list[str] = None - srun_args: list[str] = None - array: str = None + container_mounts: list[str] | None = None + srun_args: list[str] | None = None + array: str | None = None nodes: int = 1 ntasks_per_node: int = 1 gpus_per_node: int = 1 local: bool = False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@launcher/slurm_config.py` around lines 32 - 44, The annotated types for several dataclass/config fields (host, account, container, container_mounts, srun_args, array) conflict with their None defaults; update their type hints to allow None (e.g., use Optional[str] for host/account/container/array and Optional[list[str]] for container_mounts/srun_args) and add the necessary import from typing (Optional) or use union syntax (str | None, list[str] | None) so the annotation matches the default values; keep the existing defaults (None) and leave other fields (port, nodes, ntasks_per_node, gpus_per_node, modelopt_install_path, local, partition) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@launcher/core.py`:
- Around line 313-329: Remove the inline "# nosec" comments: delete the "# nosec
B404" on the "import subprocess" line and the "# nosec B603 B607" comments on
the subprocess.run calls that set the commit and branch variables (the lines
invoking git rev-parse for commit and branch). Keep the subprocess calls as-is
(list args, no shell=True, capture_output/text/timeout) and instead add to your
PR description or a nearby code comment a request for formal review by
`@NVIDIA/modelopt-setup-codeowners` with documented justification for why these
subprocess usages are safe.
In `@launcher/services/megatron-lm/quantize/quantize.sh`:
- Line 38: The assignment to MLM_EXTRA_ARGS incorrectly expands "${@}" into a
scalar; change it to either capture all args as a single string using "$*" (if
you want one string) or make MLM_EXTRA_ARGS an array by declaring
MLM_EXTRA_ARGS=("$@") (if you need separate elements), and update subsequent
uses of MLM_EXTRA_ARGS accordingly so they treat it as a string or an array as
intended.
In `@launcher/services/service_utils.sh`:
- Around line 59-62: Replace the runtime append of __version__ from the script:
stop modifying modules/Model-Optimizer/modelopt/__init__.py at runtime and
instead write/read a dedicated version file (e.g.,
modules/Model-Optimizer/VERSION) or inject the version at build-time; remove the
block that checks mpi_local_rank and echoes "__version__ = '1.0.0'"; implement a
safe creation/update step that checks for an existing VERSION file (or existing
__version__ entry) before writing and ensure the write happens in a
build/pre-deploy step or by a single global coordinator (not per-node
mpi_local_rank==0) to avoid race conditions and git dirty state.
---
Nitpick comments:
In `@launcher/pyproject.toml`:
- Around line 6-9: The dependency list in pyproject.toml pins "nemo-run" to a
commit but leaves "pyyaml" unpinned, risking non-reproducible builds; update the
dependencies array to pin "pyyaml" to a specific range or exact version (for
example change the "pyyaml" entry in the dependencies list to something like
"pyyaml>=6.0,<7.0" or an exact "pyyaml==6.0.1") so that the dependencies block
in pyproject.toml deterministically resolves.
In `@launcher/services/megatron-lm/quantize/quantize.sh`:
- Around line 46-47: The call to exit_handler in quantize.sh passes an unused
argument ($0) even though exit_handler (defined in service_utils.sh) accepts no
parameters; update the call in quantize.sh to invoke exit_handler with no
arguments (replace "exit_handler $0" with "exit_handler") so it matches the
function signature and removes the inconsistency while keeping behavior
unchanged.
- Around line 33-36: CONVERT_EXE and EXPORT_EXE are defined but never used;
either remove their definitions or document them as placeholders. Edit the file
to either delete the lines defining CONVERT_EXE and EXPORT_EXE, or add a brief
comment above them stating they are intentional placeholders for future steps
(e.g., "placeholder for future convert/export scripts") and keep the variables
if needed; leave QUANTIZE_EXE and MMLU_EXE unchanged. Ensure the chosen change
keeps the script consistent (no unused vars unless explicitly documented).
In `@launcher/services/service_utils.sh`:
- Around line 24-25: The variable FAIL is assigned but never read; update the
code path around the error_handler and variable declarations to either remove
the unused FAIL variable or export/consume it where intended: if other scripts
rely on it, export FAIL (e.g., export FAIL) and ensure error_handler sets it;
otherwise delete FAIL and only keep FAIL_EXIT. Locate the declarations FAIL and
FAIL_EXIT and the error_handler function to make the change.
- Around line 50-54: The pip install call in function util_install_extra_dep
currently installs diskcache without a version; update util_install_extra_dep to
pin diskcache to a specific, tested version (e.g., replace the pip install
diskcache invocation with a pinned install like pip install
diskcache==<VERSION>) or read the version from a single constant/variable (e.g.,
DISKCACHE_VERSION) so builds are deterministic; keep the mpi_local_rank guard
unchanged.
In `@launcher/slurm_config.py`:
- Around line 58-61: The defaults for container_mounts and srun_args are
currently mutable list literals; change their default values to None in the
function or constructor signature (e.g., container_mounts: Optional[list[str]] =
None, srun_args: Optional[list[str]] = None) and inside the function initialize
them to fresh lists if None (create the container_mounts list using
"{}:/hf-local".format(os.environ.get("SLURM_HF_LOCAL", "/hf-local")) and set
srun_args to ["--no-container-mount-home"]) so each call gets its own list and
avoids shared mutable state; update any references to these parameters
accordingly (look for container_mounts and srun_args in the function/class where
they are defined).
- Around line 32-44: The annotated types for several dataclass/config fields
(host, account, container, container_mounts, srun_args, array) conflict with
their None defaults; update their type hints to allow None (e.g., use
Optional[str] for host/account/container/array and Optional[list[str]] for
container_mounts/srun_args) and add the necessary import from typing (Optional)
or use union syntax (str | None, list[str] | None) so the annotation matches the
default values; keep the existing defaults (None) and leave other fields (port,
nodes, ntasks_per_node, gpus_per_node, modelopt_install_path, local, partition)
unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 56d56135-8c67-4a05-b578-e10120a5789f
📒 Files selected for processing (8)
launcher/Qwen/Qwen3-8B/megatron_lm_ptq.yamllauncher/README.mdlauncher/core.pylauncher/launch.pylauncher/pyproject.tomllauncher/services/megatron-lm/quantize/quantize.shlauncher/services/service_utils.shlauncher/slurm_config.py
✅ Files skipped from review due to trivial changes (1)
- launcher/README.md
| import subprocess # nosec B404 | ||
|
|
||
| try: | ||
| commit = subprocess.run( # nosec B603 B607 | ||
| ["git", "rev-parse", "--short", "HEAD"], | ||
| cwd=path, | ||
| capture_output=True, | ||
| text=True, | ||
| timeout=5, | ||
| ).stdout.strip() | ||
| branch = subprocess.run( # nosec B603 B607 | ||
| ["git", "rev-parse", "--abbrev-ref", "HEAD"], | ||
| cwd=path, | ||
| capture_output=True, | ||
| text=True, | ||
| timeout=5, | ||
| ).stdout.strip() |
There was a problem hiding this comment.
Remove # nosec comments per coding guidelines.
The coding guidelines state: "# nosec comments are not allowed as bypasses for security checks; contributors must request formal review from @NVIDIA/modelopt-setup-codeowners with documented justification instead."
The subprocess calls here are safe (list arguments, no shell=True, hardcoded commands), but the bypass comments should be removed and the pattern formally reviewed.
🔧 Suggested fix
def _git_info(path):
"""Get git commit hash and branch for a directory."""
- import subprocess # nosec B404
+ import subprocess
try:
- commit = subprocess.run( # nosec B603 B607
+ commit = subprocess.run(
["git", "rev-parse", "--short", "HEAD"],
cwd=path,
capture_output=True,
text=True,
timeout=5,
).stdout.strip()
- branch = subprocess.run( # nosec B603 B607
+ branch = subprocess.run(
["git", "rev-parse", "--abbrev-ref", "HEAD"],
cwd=path,
capture_output=True,
text=True,
timeout=5,
).stdout.strip()As per coding guidelines: "Disallow use of '# nosec' comments to bypass Bandit security checks."
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| import subprocess # nosec B404 | |
| try: | |
| commit = subprocess.run( # nosec B603 B607 | |
| ["git", "rev-parse", "--short", "HEAD"], | |
| cwd=path, | |
| capture_output=True, | |
| text=True, | |
| timeout=5, | |
| ).stdout.strip() | |
| branch = subprocess.run( # nosec B603 B607 | |
| ["git", "rev-parse", "--abbrev-ref", "HEAD"], | |
| cwd=path, | |
| capture_output=True, | |
| text=True, | |
| timeout=5, | |
| ).stdout.strip() | |
| import subprocess | |
| try: | |
| commit = subprocess.run( | |
| ["git", "rev-parse", "--short", "HEAD"], | |
| cwd=path, | |
| capture_output=True, | |
| text=True, | |
| timeout=5, | |
| ).stdout.strip() | |
| branch = subprocess.run( | |
| ["git", "rev-parse", "--abbrev-ref", "HEAD"], | |
| cwd=path, | |
| capture_output=True, | |
| text=True, | |
| timeout=5, | |
| ).stdout.strip() |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@launcher/core.py` around lines 313 - 329, Remove the inline "# nosec"
comments: delete the "# nosec B404" on the "import subprocess" line and the "#
nosec B603 B607" comments on the subprocess.run calls that set the commit and
branch variables (the lines invoking git rev-parse for commit and branch). Keep
the subprocess calls as-is (list args, no shell=True,
capture_output/text/timeout) and instead add to your PR description or a nearby
code comment a request for formal review by `@NVIDIA/modelopt-setup-codeowners`
with documented justification for why these subprocess usages are safe.
| CONVERT_EXE="bash modules/Megatron-LM/examples/post_training/modelopt/convert.sh" | ||
| EXPORT_EXE="bash modules/Megatron-LM/examples/post_training/modelopt/export.sh" | ||
|
|
||
| export MLM_EXTRA_ARGS=${@} |
There was a problem hiding this comment.
Fix array-to-string assignment.
${@} expands to separate words but is assigned to a scalar. Use "$*" if a single string is intended, or declare MLM_EXTRA_ARGS as an array.
🔧 Suggested fix (if string is intended)
-export MLM_EXTRA_ARGS=${@}
+export MLM_EXTRA_ARGS="$*"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| export MLM_EXTRA_ARGS=${@} | |
| export MLM_EXTRA_ARGS="$*" |
🧰 Tools
🪛 Shellcheck (0.11.0)
[warning] 38-38: Assigning an array to a string! Assign as array, or use * instead of @ to concatenate.
(SC2124)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@launcher/services/megatron-lm/quantize/quantize.sh` at line 38, The
assignment to MLM_EXTRA_ARGS incorrectly expands "${@}" into a scalar; change it
to either capture all args as a single string using "$*" (if you want one
string) or make MLM_EXTRA_ARGS an array by declaring MLM_EXTRA_ARGS=("$@") (if
you need separate elements), and update subsequent uses of MLM_EXTRA_ARGS
accordingly so they treat it as a string or an array as intended.
| # Increase the modelopt version number manually | ||
| if [[ "$mpi_local_rank" -eq 0 ]]; then | ||
| echo "__version__ = '1.0.0'" >> ./modules/Model-Optimizer/modelopt/__init__.py | ||
| fi |
There was a problem hiding this comment.
Runtime modification of __init__.py is fragile and may cause race conditions.
Appending __version__ to the modelopt __init__.py at runtime:
- Modifies source files within a submodule, which can cause git dirty state
- May race with other MPI ranks despite the
mpi_local_rank == 0guard (multiple nodes each have a local rank 0) - Doesn't check if
__version__already exists, causing duplicate entries on repeated runs
Consider moving version management to build-time or using a separate version file that's explicitly managed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@launcher/services/service_utils.sh` around lines 59 - 62, Replace the runtime
append of __version__ from the script: stop modifying
modules/Model-Optimizer/modelopt/__init__.py at runtime and instead write/read a
dedicated version file (e.g., modules/Model-Optimizer/VERSION) or inject the
version at build-time; remove the block that checks mpi_local_rank and echoes
"__version__ = '1.0.0'"; implement a safe creation/update step that checks for
an existing VERSION file (or existing __version__ entry) before writing and
ensure the write happens in a build/pre-deploy step or by a single global
coordinator (not per-node mpi_local_rank==0) to avoid race conditions and git
dirty state.
There was a problem hiding this comment.
Thoughts on making top-level folder name generic like tools inside which we have launcher?
| @@ -0,0 +1,13 @@ | |||
| script: services/megatron-lm/quantize/quantize.sh | |||
| args: | |||
| - --calib-dataset-path-or-name /hf-local/abisee/cnn_dailymail | |||
There was a problem hiding this comment.
Can we mention in README what hf-local is? Or for simplicity we can just make it HF path abisee/cnn_dailymail
|
|
||
| # Increase the modelopt version number manually | ||
| if [[ "$mpi_local_rank" -eq 0 ]]; then | ||
| echo "__version__ = '1.0.0'" >> ./modules/Model-Optimizer/modelopt/__init__.py |
There was a problem hiding this comment.
why do we need to modify ModelOpt version number?
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
There was a problem hiding this comment.
♻️ Duplicate comments (4)
launcher/core.py (4)
432-455:⚠️ Potential issue | 🟠 MajorPreserve task-level env overrides when applying defaults.
On Line 443 and Line 455, defaults overwrite task-specific values. Reverse the merge precedence so YAML task env wins.
Suggested change
- task_env.update(default_local_env) + task_env = {**default_local_env, **task_env} @@ - task_env.update(default_slurm_env) + task_env = {**default_slurm_env, **task_env}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@launcher/core.py` around lines 432 - 455, The current merges call task_env.update(default_local_env) / task_env.update(default_slurm_env) which lets defaults overwrite task-specific YAML values; instead, apply defaults first then layer task-level env on top so task YAML wins—e.g., obtain the appropriate default dict for the branch (for docker: default_local_env; for slurm: default_slurm_env), copy/merge defaults first and then update with task_env (or update a copy of defaults with task_env) before passing to build_docker_executor / build_slurm_executor; target the hf_local branch and the calls around build_docker_executor, build_slurm_executor and the task_env.update(...) lines.
482-485:⚠️ Potential issue | 🟠 MajorWrite metadata under the configured NeMo Run home, not CWD-relative.
metadata.jsonis currently written to a relative path on Line 482, which can diverge from the configuredNEMORUN_HOMEset by the launcher.Suggested change
-def run_jobs( +def run_jobs( job_table, @@ modelopt_src_path=None, base_dir=None, + nemorun_home=None, ): @@ - metadata_path = os.path.join("experiments", experiment_title, exp._id, "metadata.json") + root = nemorun_home or os.environ.get("NEMORUN_HOME", os.getcwd()) + metadata_path = os.path.join(root, "experiments", experiment_title, exp._id, "metadata.json")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@launcher/core.py` around lines 482 - 485, The code writes metadata.json to a CWD-relative path using metadata_path built from "experiments", experiment_title and exp._id; change it to write under the configured NeMo Run home instead—build metadata_path by joining the launcher’s configured NeMo Run home variable (e.g., NEMORUN_HOME or the instance/home attribute used by the launcher) with "experiments", experiment_title and exp._id, ensure os.makedirs uses os.path.dirname(metadata_path) and then json.dump metadata to that path so files are created under the configured NeMo Run home rather than the current working directory.
436-450:⚠️ Potential issue | 🟠 MajorAvoid relying on
run.Experiment._idprivate API.Using
_idcouples this code to an internal nemo_run attribute and risks breakage on dependency updates.In the currently used nemo_run version, what is the supported public API (if any) for retrieving an Experiment identifier after creation, and is `_id` considered internal-only?Also applies to: 477-482
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@launcher/core.py` around lines 436 - 450, Replace uses of the private attribute run.Experiment._id with the official public Experiment identifier accessor from the nemo_run API (for example exp.id or exp.get_id(), whichever the library exposes) everywhere it's passed (e.g., to build_slurm_executor and the local executor call sites referenced around build_slurm_executor and the earlier local executor block). Locate all occurrences of exp._id (including the other block noted) and switch them to the public accessor; if the library requires calling a method at creation time, capture and reuse that returned id instead of reading a private field. Ensure the code compiles and tests that rely on the experiment id are updated to use the new accessor.
321-332:⚠️ Potential issue | 🟠 MajorRemove Bandit bypass comments from subprocess calls.
# nosecon Line 321, Line 324, and Line 331 violates repo policy even when the subprocess usage is otherwise safe.Suggested change
- import subprocess # nosec B404 + import subprocess @@ - commit = subprocess.run( # nosec B603 B607 + commit = subprocess.run( @@ - branch = subprocess.run( # nosec B603 B607 + branch = subprocess.run(As per coding guidelines, "Disallow use of '# nosec' comments to bypass Bandit security checks."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@launcher/core.py` around lines 321 - 332, Remove the "# nosec" comments on the subprocess import and the subprocess.run calls (the lines that set commit and branch), and instead make the calls explicitly safe: keep the argument lists (["git", "rev-parse", ...]) to avoid shell=True, add check=True to both subprocess.run calls and handle subprocess.CalledProcessError (or subprocess.SubprocessError) to log/raise a clear error on failure, and optionally validate git exists with shutil.which("git") before invoking; update the code around variables commit and branch to use this checked/run-and-handle pattern so Bandit suppression comments are no longer needed.
🧹 Nitpick comments (1)
launcher/core.py (1)
133-135: Fail fast with a clear error for unknown/missing YAML factory names.Raw
KeyErrorfromfactory_lookup[function_name]is hard to diagnose for config authors.Suggested change
- function_name = config_from_yaml["slurm_config"].pop("_factory_") - slurm_config = factory_lookup[function_name](**config_from_yaml["slurm_config"]) + function_name = config_from_yaml["slurm_config"].pop("_factory_", None) + if not function_name: + raise ValueError(f"Missing '_factory_' in slurm_config for YAML: {yaml_file}") + try: + factory = factory_lookup[function_name] + except KeyError as exc: + raise ValueError(f"Unsupported slurm_config factory '{function_name}' in {yaml_file}") from exc + slurm_config = factory(**config_from_yaml["slurm_config"])🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@launcher/core.py` around lines 133 - 135, The code currently assumes config_from_yaml["slurm_config"]["_factory_"] exists and that factory_lookup[function_name] will succeed, which yields an opaque KeyError for config authors; update the block around function_name, factory_lookup, and slurm_config to explicitly check for the "_factory_" key on config_from_yaml["slurm_config"] and that function_name is present in factory_lookup, and if not raise a clear ValueError (or custom exception) that states the missing or unknown factory name and lists available keys from factory_lookup so the user can correct their YAML; keep subsequent behavior of calling factory_lookup[function_name](**config_from_yaml["slurm_config"]) and preserving args = config_from_yaml.get("args", None).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@launcher/core.py`:
- Around line 432-455: The current merges call
task_env.update(default_local_env) / task_env.update(default_slurm_env) which
lets defaults overwrite task-specific YAML values; instead, apply defaults first
then layer task-level env on top so task YAML wins—e.g., obtain the appropriate
default dict for the branch (for docker: default_local_env; for slurm:
default_slurm_env), copy/merge defaults first and then update with task_env (or
update a copy of defaults with task_env) before passing to build_docker_executor
/ build_slurm_executor; target the hf_local branch and the calls around
build_docker_executor, build_slurm_executor and the task_env.update(...) lines.
- Around line 482-485: The code writes metadata.json to a CWD-relative path
using metadata_path built from "experiments", experiment_title and exp._id;
change it to write under the configured NeMo Run home instead—build
metadata_path by joining the launcher’s configured NeMo Run home variable (e.g.,
NEMORUN_HOME or the instance/home attribute used by the launcher) with
"experiments", experiment_title and exp._id, ensure os.makedirs uses
os.path.dirname(metadata_path) and then json.dump metadata to that path so files
are created under the configured NeMo Run home rather than the current working
directory.
- Around line 436-450: Replace uses of the private attribute run.Experiment._id
with the official public Experiment identifier accessor from the nemo_run API
(for example exp.id or exp.get_id(), whichever the library exposes) everywhere
it's passed (e.g., to build_slurm_executor and the local executor call sites
referenced around build_slurm_executor and the earlier local executor block).
Locate all occurrences of exp._id (including the other block noted) and switch
them to the public accessor; if the library requires calling a method at
creation time, capture and reuse that returned id instead of reading a private
field. Ensure the code compiles and tests that rely on the experiment id are
updated to use the new accessor.
- Around line 321-332: Remove the "# nosec" comments on the subprocess import
and the subprocess.run calls (the lines that set commit and branch), and instead
make the calls explicitly safe: keep the argument lists (["git", "rev-parse",
...]) to avoid shell=True, add check=True to both subprocess.run calls and
handle subprocess.CalledProcessError (or subprocess.SubprocessError) to
log/raise a clear error on failure, and optionally validate git exists with
shutil.which("git") before invoking; update the code around variables commit and
branch to use this checked/run-and-handle pattern so Bandit suppression comments
are no longer needed.
---
Nitpick comments:
In `@launcher/core.py`:
- Around line 133-135: The code currently assumes
config_from_yaml["slurm_config"]["_factory_"] exists and that
factory_lookup[function_name] will succeed, which yields an opaque KeyError for
config authors; update the block around function_name, factory_lookup, and
slurm_config to explicitly check for the "_factory_" key on
config_from_yaml["slurm_config"] and that function_name is present in
factory_lookup, and if not raise a clear ValueError (or custom exception) that
states the missing or unknown factory name and lists available keys from
factory_lookup so the user can correct their YAML; keep subsequent behavior of
calling factory_lookup[function_name](**config_from_yaml["slurm_config"]) and
preserving args = config_from_yaml.get("args", None).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: ce5b1f59-c0be-4a83-9853-3cff952f9271
📒 Files selected for processing (2)
launcher/core.pylauncher/launch.py
launch.py now only accepts pipeline=@ or --yaml. Update README with --yaml vs pipeline=@ docs, useful flags, and bug reporting. Update Qwen3-8B config to new --yaml format with job_name + pipeline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
There was a problem hiding this comment.
♻️ Duplicate comments (2)
launcher/launch.py (2)
50-50:⚠️ Potential issue | 🟠 MajorDefault
HF_HOMEstill appears inconsistent with/hf-localmounts.
get_default_env(EXPERIMENT_TITLE)yieldsHF_HOME=/<title>/hf-cache, while the Slurm factory mount target is/hf-local. That means cache reuse can be missed unless every task overridesHF_HOME.🔧 Proposed fix
DEFAULT_SLURM_ENV, DEFAULT_LOCAL_ENV = get_default_env(EXPERIMENT_TITLE) +DEFAULT_SLURM_ENV["HF_HOME"] = "/hf-local" +DEFAULT_LOCAL_ENV["HF_HOME"] = "/hf-local"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@launcher/launch.py` at line 50, The default HF_HOME produced by get_default_env(EXPERIMENT_TITLE) is set to /<title>/hf-cache which does not match the Slurm factory mount target /hf-local, causing cache reuse to be missed; update get_default_env (or the DEFAULT_SLURM_ENV/DEFAULT_LOCAL_ENV initialization) to set HF_HOME to the mounted path /hf-local (or derive HF_HOME from the Slurm mount constant) so that DEFAULT_SLURM_ENV and DEFAULT_LOCAL_ENV use the same HF_HOME as the Slurm mounts and tasks inherit the consistent cache location.
52-62:⚠️ Potential issue | 🟠 Major
services/*packaging may resolve from the wrong base directory.
include_patternincludesservices/*, butrelative_pathis fully anchored toLAUNCHER_DIR. Ifservices/is at repo root, this silently excludes it from packaging.🔧 Proposed fix
packager = run.PatternPackager( @@ - relative_path=[LAUNCHER_DIR] * 6, + relative_path=[LAUNCHER_DIR] * 5 + [MODELOPT_ROOT], )#!/bin/bash set -euo pipefail echo "== services directories in repository ==" fd -HI -td '^services$' . echo echo "== Does launcher/services exist? ==" if [ -d launcher/services ]; then echo "launcher/services exists" else echo "launcher/services missing" fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@launcher/launch.py` around lines 52 - 62, The packaging include_pattern lists "services/*" while relative_path is set to [LAUNCHER_DIR] * 6, which will exclude repo-root services if they live outside LAUNCHER_DIR; update the run.PatternPackager invocation so the relative_path entry that pairs with the "services/*" pattern points to the correct base (e.g. repo root) instead of LAUNCHER_DIR — locate the run.PatternPackager call and change the corresponding element in relative_path (or compute relative_path dynamically) so the "services/*" pattern resolves from the actual services directory.
🧹 Nitpick comments (2)
launcher/launch.py (2)
74-76: Avoid unconditionally overriding caller-providedjob_dirin local mode.Line 87-Line 89 always rewrites
job_dirwhenhf_localis set, which drops explicit user input.🔧 Proposed refactor
def launch( job_name: str = "01_job", - job_dir: str = os.environ.get("SLURM_JOB_DIR", os.path.expanduser("~/experiments")), + job_dir: str | None = None, @@ ) -> None: @@ - if hf_local is not None: - job_dir = os.path.join(os.getcwd(), "local_experiments") + if hf_local is not None: + job_dir = job_dir or os.path.join(os.getcwd(), "local_experiments") + else: + job_dir = job_dir or os.environ.get("SLURM_JOB_DIR", os.path.expanduser("~/experiments"))Also applies to: 87-89
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@launcher/launch.py` around lines 74 - 76, The current behavior unconditionally replaces the caller-provided job_dir when hf_local is true; change this so job_dir is only auto-set when the caller didn't provide one: make job_dir default to None in the function signature (job_dir: Optional[str] = None) and initialize it from os.environ.get("SLURM_JOB_DIR", os.path.expanduser("~/experiments")) only if job_dir is None, and likewise in the hf_local branch update job_dir only when job_dir is None or empty; update references to the job_dir variable accordingly (look for the job_dir parameter and the hf_local handling block).
36-37: Consider using absolute imports for better package reusability.Lines 36-37 use relative imports (
core,slurm_config), which work for the documented script usage (uv run launch.py), but are fragile if the code is ever imported as a module or if entry points are added topyproject.toml. Sincelauncheris a proper Python package with package infrastructure, absolute imports with fallback improve maintainability.The proposed try-except approach is reasonable:
🔧 Suggested approach
-from core import SandboxPipeline, get_default_env, register_factory, run_jobs, set_slurm_config_type -from slurm_config import SlurmConfig, slurm_factory +try: + from launcher.core import ( + SandboxPipeline, + get_default_env, + register_factory, + run_jobs, + set_slurm_config_type, + ) + from launcher.slurm_config import SlurmConfig, slurm_factory +except ImportError: + # Script-mode fallback (e.g., running from launcher/ directly) + from core import SandboxPipeline, get_default_env, register_factory, run_jobs, set_slurm_config_type + from slurm_config import SlurmConfig, slurm_factory🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@launcher/launch.py` around lines 36 - 37, Replace the fragile relative imports in launch.py with absolute-package imports and a fallback to the relative names: attempt to import SandboxPipeline, get_default_env, register_factory, run_jobs, set_slurm_config_type from the top-level package (e.g., launcher.core or package.core) and SlurmConfig, slurm_factory from the package slurm_config, and if that fails in an ImportError, fall back to the current relative imports; locate the import line that currently references core and slurm_config and update it to a try/except ImportError that assigns the same symbols (SandboxPipeline, get_default_env, register_factory, run_jobs, set_slurm_config_type, SlurmConfig, slurm_factory) so callers elsewhere in the file keep working regardless of whether the module is run as a script or imported as a package.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@launcher/launch.py`:
- Line 50: The default HF_HOME produced by get_default_env(EXPERIMENT_TITLE) is
set to /<title>/hf-cache which does not match the Slurm factory mount target
/hf-local, causing cache reuse to be missed; update get_default_env (or the
DEFAULT_SLURM_ENV/DEFAULT_LOCAL_ENV initialization) to set HF_HOME to the
mounted path /hf-local (or derive HF_HOME from the Slurm mount constant) so that
DEFAULT_SLURM_ENV and DEFAULT_LOCAL_ENV use the same HF_HOME as the Slurm mounts
and tasks inherit the consistent cache location.
- Around line 52-62: The packaging include_pattern lists "services/*" while
relative_path is set to [LAUNCHER_DIR] * 6, which will exclude repo-root
services if they live outside LAUNCHER_DIR; update the run.PatternPackager
invocation so the relative_path entry that pairs with the "services/*" pattern
points to the correct base (e.g. repo root) instead of LAUNCHER_DIR — locate the
run.PatternPackager call and change the corresponding element in relative_path
(or compute relative_path dynamically) so the "services/*" pattern resolves from
the actual services directory.
---
Nitpick comments:
In `@launcher/launch.py`:
- Around line 74-76: The current behavior unconditionally replaces the
caller-provided job_dir when hf_local is true; change this so job_dir is only
auto-set when the caller didn't provide one: make job_dir default to None in the
function signature (job_dir: Optional[str] = None) and initialize it from
os.environ.get("SLURM_JOB_DIR", os.path.expanduser("~/experiments")) only if
job_dir is None, and likewise in the hf_local branch update job_dir only when
job_dir is None or empty; update references to the job_dir variable accordingly
(look for the job_dir parameter and the hf_local handling block).
- Around line 36-37: Replace the fragile relative imports in launch.py with
absolute-package imports and a fallback to the relative names: attempt to import
SandboxPipeline, get_default_env, register_factory, run_jobs,
set_slurm_config_type from the top-level package (e.g., launcher.core or
package.core) and SlurmConfig, slurm_factory from the package slurm_config, and
if that fails in an ImportError, fall back to the current relative imports;
locate the import line that currently references core and slurm_config and
update it to a try/except ImportError that assigns the same symbols
(SandboxPipeline, get_default_env, register_factory, run_jobs,
set_slurm_config_type, SlurmConfig, slurm_factory) so callers elsewhere in the
file keep working regardless of whether the module is run as a script or
imported as a package.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 07f18aab-eab7-4f3b-812c-8122649ca802
📒 Files selected for processing (3)
launcher/Qwen/Qwen3-8B/megatron_lm_ptq.yamllauncher/README.mdlauncher/launch.py
🚧 Files skipped from review as they are similar to previous changes (2)
- launcher/README.md
- launcher/Qwen/Qwen3-8B/megatron_lm_ptq.yaml
Add launcher/ module with launch.py that submits quantization, training, and evaluation jobs to Slurm clusters via nemo-run. Produces identical code/ layout as nmm-sandbox's slurm.py so the same YAML configs work in both. Includes Megatron-LM and Model-Optimizer as submodules.
What does this PR do?
Type of change: ?
Usage
# Add a code snippet demonstrating how to use thisTesting
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: ✅ / ❌ / N/AAdditional Information
Summary by CodeRabbit
New Features
Documentation
Chores