feat: DTensorPolicyV2 GPT-OSS support #1470

adil-a · 2025-11-04T18:23:52Z

What does this PR do ?

Adds GPT-OSS SFT using AutoModel custom models + DeepEP.

To run, launch the nightly container and run

NRL_FORCE_REBUILD_VENVS=true uv run examples/run_sft.py --config examples/configs/recipes/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.yaml cluster.gpus_per_node=8 logger.wandb_enabled=false

Signed-off-by: adil-a <[email protected]>

Signed-off-by: Hemil Desai <[email protected]>

Signed-off-by: adil-a <[email protected]>

github-actions · 2025-11-05T06:50:10Z

⚠️ File Consistency Check

Check based on commit: e936ebf (PR #1470 from hemil/automodel-moe)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

Please review if the changes in nemo_rl/models/policy/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/dtensor_policy_worker.py
Update nemo_rl/models/policy/dtensor_policy_worker.py if necessary to maintain consistency
If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

Modified: nemo_rl/models/policy/dtensor_policy_worker_v2.py
Not modified: nemo_rl/models/policy/dtensor_policy_worker.py

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

github-actions · 2025-11-05T06:50:34Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: e936ebf (PR #1470 from hemil/automodel-moe)

❌ Submodules that need attention:

Automodel: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Automodel/commits/a2db048383cd54b3fafc928df4c30bf7bbf7c430/
CURRENT (PR #1470 from hemil/automodel-moe): https://github.com/NVIDIA-NeMo/Automodel/commits/5e995e9535e63cbe3358dc2bd81a8ed3a696cee7/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

Signed-off-by: adil-a <[email protected]>

github-actions · 2025-11-05T06:52:10Z

⚠️ File Consistency Check

Check based on commit: 7df0cc5 (PR #1470 from hemil/automodel-moe)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

Please review if the changes in nemo_rl/models/policy/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/dtensor_policy_worker.py
Update nemo_rl/models/policy/dtensor_policy_worker.py if necessary to maintain consistency
If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

Modified: nemo_rl/models/policy/dtensor_policy_worker_v2.py
Not modified: nemo_rl/models/policy/dtensor_policy_worker.py

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

github-actions · 2025-11-05T06:52:42Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: 7df0cc5 (PR #1470 from hemil/automodel-moe)

❌ Submodules that need attention:

Automodel: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Automodel/commits/a2db048383cd54b3fafc928df4c30bf7bbf7c430/
CURRENT (PR #1470 from hemil/automodel-moe): https://github.com/NVIDIA-NeMo/Automodel/commits/5e995e9535e63cbe3358dc2bd81a8ed3a696cee7/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

Signed-off-by: adil-a <[email protected]>

github-actions · 2025-11-05T06:58:26Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: b4139f1 (PR #1470 from hemil/automodel-moe)

❌ Submodules that need attention:

Automodel: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Automodel/commits/a2db048383cd54b3fafc928df4c30bf7bbf7c430/
CURRENT (PR #1470 from hemil/automodel-moe): https://github.com/NVIDIA-NeMo/Automodel/commits/5e995e9535e63cbe3358dc2bd81a8ed3a696cee7/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

Signed-off-by: adil-a <[email protected]>

github-actions · 2025-11-05T07:00:34Z

⚠️ File Consistency Check

Check based on commit: a55a2f1 (PR #1470 from hemil/automodel-moe)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

Please review if the changes in nemo_rl/models/policy/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/dtensor_policy_worker.py
Update nemo_rl/models/policy/dtensor_policy_worker.py if necessary to maintain consistency
If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

Modified: nemo_rl/models/policy/dtensor_policy_worker_v2.py
Not modified: nemo_rl/models/policy/dtensor_policy_worker.py

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

github-actions · 2025-11-05T07:00:36Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: a55a2f1 (PR #1470 from hemil/automodel-moe)

❌ Submodules that need attention:

Automodel: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Automodel/commits/a2db048383cd54b3fafc928df4c30bf7bbf7c430/
CURRENT (PR #1470 from hemil/automodel-moe): https://github.com/NVIDIA-NeMo/Automodel/commits/5e995e9535e63cbe3358dc2bd81a8ed3a696cee7/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

coderabbitai · 2025-11-05T07:01:32Z

📝 Walkthrough

Walkthrough

The pull request updates the Automodel submodule pointer, integrates Automodel Checkpointer into DTensorPolicyWorkerV2 for improved checkpoint handling, adds a comprehensive SFT training configuration example, removes the legacy automodel_checkpoint utility module, updates package dependencies (grouped_gemm, transformer-engine, deep_ep), and enhances venv setup with CUDA architecture configuration.

Changes

Cohort / File(s)	Summary
Automodel Submodule & Configuration `3rdparty/Automodel-workspace/Automodel`, `examples/configs/sft_automodel.yaml`	Submodule pointer updated to commit 5e995e9535e63cbe3358dc2bd81a8ed3a696cee7. New comprehensive YAML configuration for SFT training with detailed sections for model scale (gpt-oss-20b), mixed-precision settings, distributed training, optimizer/scheduler, and logging backends (wandb, tensorboard, mlflow).
Checkpoint Management Integration `nemo_rl/models/policy/dtensor_policy_worker_v2.py`	Replaced direct state-dict loading with Automodel Checkpointer workflow. Added checkpoint format detection (detect_checkpoint_format, _infer_checkpoint_root), persistent Checkpointer management (_ensure_checkpointer), and updated save/load paths. New imports from nemo_automodel (Checkpointer, FSDP2Manager, moe_parallelize_model). Integrated MOE/adapter model parallelization and distributed gradient utilities.
Checkpoint Utility Migration `nemo_rl/utils/automodel_checkpoint.py`	Entire file removed; functionality delegated to integrated Checkpointer in DTensorPolicyWorkerV2 and Automodel APIs.
Import Path Updates `nemo_rl/models/policy/utils.py`	Import path for Nemo automodel classes updated from `nemo_automodel.components._transformers.auto_model` to `nemo_automodel._transformers.auto_model`.
Infrastructure & Dependencies `nemo_rl/utils/venvs.py`, `pyproject.toml`	venvs.py: Added explicit shutil.rmtree cleanup and TORCH_CUDA_ARCH_LIST environment setup ("9.0 10.0 12.0"). pyproject.toml: Added grouped_gemm and transformer-engine[pytorch] (2.8.0), updated deep_ep commit hash, extended override-dependencies (opencv-python-headless>=4.11.0), modified build-isolation mappings (causal-conv1d, grouped-gemm).

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant DTensorPolicyWorkerV2
    participant Checkpointer
    participant FSDP2Manager
    participant HFModel
    participant Cache

    User->>DTensorPolicyWorkerV2: load_checkpoint(path, ...)
    DTensorPolicyWorkerV2->>DTensorPolicyWorkerV2: detect_checkpoint_format(path)
    DTensorPolicyWorkerV2->>DTensorPolicyWorkerV2: _infer_checkpoint_root(path)
    DTensorPolicyWorkerV2->>HFModel: initialize empty model
    DTensorPolicyWorkerV2->>FSDP2Manager: parallelize model (FSDP2)
    DTensorPolicyWorkerV2->>DTensorPolicyWorkerV2: _ensure_checkpointer(config)
    DTensorPolicyWorkerV2->>Checkpointer: load via Automodel API
    Checkpointer->>Cache: fetch or download weights
    Checkpointer->>HFModel: apply loaded state dict
    HFModel-->>DTensorPolicyWorkerV2: model ready
    DTensorPolicyWorkerV2-->>User: checkpoint loaded

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

nemo_rl/models/policy/dtensor_policy_worker_v2.py: Substantial refactoring with new Checkpointer integration, format detection logic, FSDP2Manager usage, and MOE parallelization paths; requires careful validation of state-dict handling and rank-distributed checkpoint flow.
File deletion (nemo_rl/utils/automodel_checkpoint.py): Verify that all call sites have been migrated to the new Checkpointer-based approach in DTensorPolicyWorkerV2.
Import path changes: Confirm that the updated import path in utils.py and new imports in dtensor_policy_worker_v2.py align with Automodel's actual structure.
pyproject.toml dependency updates: Validate that new build-isolation mappings (grouped-gemm, causal-conv1d) and pinned versions are compatible with existing toolchain.

Possibly related PRs

chore: major version bump (torch 2.8, vllm 0.11, ray 2.49) & SP fixes #1334: Updates the same 3rdparty/Automodel-workspace/Automodel submodule pointer and Automodel integration patterns.
feat: Implement safetensors checkpointing format support using nemo-automodel #1023: Touches the same automodel checkpoint integration (nemo_rl/utils/automodel_checkpoint.py and DTensorPolicyWorkerV2 checkpoint workflows).

Suggested labels

CI:L1, CI

Suggested reviewers

terrykong
yaoyu-33

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR contains major changes (checkpoint system refactoring, utility module removal, new format detection) but PR description lacks test results documentation, all testing checklist items remain unchecked, and review comments indicate critical untested issues.	Document test results including unit/integration tests, performance comparisons, convergence validation, fix critical review issues, and check all testing checklist items with evidence of local test execution.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat: DTensorPolicyV2 GPT-OSS support' directly corresponds to the primary change: adding GPT-OSS model support to DTensorPolicyWorkerV2 through Automodel integration.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch hemil/automodel-moe

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8762f57 and 7df0cc5.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (7)

3rdparty/Automodel-workspace/Automodel (1 hunks)
examples/configs/sft_automodel.yaml (1 hunks)
nemo_rl/models/policy/dtensor_policy_worker_v2.py (20 hunks)
nemo_rl/models/policy/utils.py (1 hunks)
nemo_rl/utils/automodel_checkpoint.py (0 hunks)
nemo_rl/utils/venvs.py (2 hunks)
pyproject.toml (4 hunks)

💤 Files with no reviewable changes (1)

nemo_rl/utils/automodel_checkpoint.py

🧰 Additional context used

📓 Path-based instructions (3)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts

Files:

nemo_rl/models/policy/utils.py
nemo_rl/models/policy/dtensor_policy_worker_v2.py
nemo_rl/utils/venvs.py

nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)

Files:

nemo_rl/models/policy/utils.py
nemo_rl/models/policy/dtensor_policy_worker_v2.py
nemo_rl/utils/venvs.py

examples/configs/*.yaml

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

examples/configs/*.yaml: Exemplar configs under examples/configs/.yaml must include documented defaults
When adding a new config key, reflect its recommended default in exemplar YAMLs under examples/configs/.yaml

Files:

examples/configs/sft_automodel.yaml

🧠 Learnings (4)

📚 Learning: 2025-09-20T14:58:45.492Z

Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-20T14:58:45.492Z
Learning: Applies to nemo_rl/**/*.py : Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults

Applied to files:

nemo_rl/models/policy/utils.py

📚 Learning: 2025-09-17T01:52:21.399Z

Learnt from: ffrujeri
Repo: NVIDIA-NeMo/RL PR: 1023
File: nemo_rl/utils/checkpoint.py:58-65
Timestamp: 2025-09-17T01:52:21.399Z
Learning: model_state_dict_keys is not intended to be part of the nemo-rl CheckpointingConfig TypedDict - it's handled at the automodel implementation layer, not as a general checkpointing configuration parameter.

Applied to files:

nemo_rl/models/policy/dtensor_policy_worker_v2.py

📚 Learning: 2025-10-30T20:50:44.126Z

Learnt from: adil-a
Repo: NVIDIA-NeMo/RL PR: 1440
File: examples/configs/sft_automodel.yaml:48-58
Timestamp: 2025-10-30T20:50:44.126Z
Learning: In DTensor configurations for MoE (Mixture of Experts) models, expert_parallel_size and data_parallel_size can be applied together without multiplying the GPU requirements. Expert Parallelism (EP) only applies to MoE layers, while Data Parallelism/FSDP applies to non-MoE layers. Therefore, configurations like expert_parallel_size: 8 and data_parallel_size: 8 are valid on an 8-GPU cluster for MoE models.

Applied to files:

nemo_rl/models/policy/dtensor_policy_worker_v2.py

📚 Learning: 2025-09-20T14:58:45.492Z

Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-20T14:58:45.492Z
Learning: Applies to **/*.sh : Use `uv run` to execute Python scripts in shell/driver scripts instead of activating virtualenvs and calling `python` directly

Applied to files:

nemo_rl/utils/venvs.py

🧬 Code graph analysis (1)

nemo_rl/models/policy/dtensor_policy_worker_v2.py (3)

nemo_rl/utils/checkpoint.py (1)

CheckpointingConfig (36-67)

nemo_rl/models/policy/dtensor_policy_worker.py (3)

create_context_parallel_ctx (456-480)

get_cpu_state_dict (104-134)

load_checkpoint (1920-1930)

nemo_rl/models/dtensor/parallelize.py (1)

to_local_if_dtensor (709-715)

🪛 GitHub Actions: Automodel Integration and Submodule Checks

3rdparty/Automodel-workspace/Automodel

[error] 1-1: One or more submodules are not fast-forwarded. Automodel: Commits have DIVERGED from a common ancestor. Please ensure submodule commits are fast-forwards of the main branch.

🪛 Ruff (0.14.3)

nemo_rl/models/policy/dtensor_policy_worker_v2.py

2105-2105: Loop control variable root not used within loop body

Rename unused root to _root

(B007)

2105-2105: Loop control variable dirs not used within loop body

Rename unused dirs to _dirs

(B007)

🔇 Additional comments (1)

nemo_rl/models/policy/utils.py (1)

32-32: Verify the new import path is valid in the updated nemo_automodel submodule.

The import change appears scoped and intentional—only _transformers is being reorganized out of .components, while other submodules remain unchanged. The try-except block provides appropriate fallback handling, and the imported classes are actively used in AUTOMODEL_FACTORY and resolve_model_class().

However, I could not locate test files confirming the new import path works. Please verify:

The new path (nemo_automodel._transformers.auto_model) exists and is importable in the updated nemo_automodel submodule version

Tests pass with this change, or if untested, manually confirm the import succeeds at runtime

NEMO_AUTOMODEL_AVAILABLE resolves to True in your environment after this change

coderabbitai · 2025-11-05T07:01:35Z

3rdparty/Automodel-workspace/Automodel

@@ -1 +1 @@
-Subproject commit a2db048383cd54b3fafc928df4c30bf7bbf7c430
+Subproject commit 5e995e9535e63cbe3358dc2bd81a8ed3a696cee7


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Critical: Submodule commit divergence blocks integration.

The pipeline reports that the Automodel submodule commits have diverged from a common ancestor and are not fast-forwarded. This is a blocker that must be resolved before merging.

Submodule divergence can:

Break reproducibility and introduce non-deterministic behavior

Cause merge conflicts and complicate future submodule updates

Obscure commit history and make debugging harder

The AI summary indicates significant integration work (Automodel Checkpointer integration in DTensorPolicyWorkerV2, removal of legacy automodel_checkpoint utility). This context makes it critical to understand what changes in the Automodel submodule underpin these refactors and ensure the submodule update is intentional and correct.

Please clarify and resolve the divergence:

Understand the divergence: Run the verification script below to inspect the commit history and identify which commits diverged and why.

Align the branch: Consider whether the divergence should be resolved by:

Rebasing the Automodel branch onto main to make it a proper fast-forward

Explicitly merging main into the Automodel branch if concurrent work requires it

Selecting specific commits via cherry-pick if only certain commits are needed (with clear justification in the PR description)

Document the strategy: Update the PR description to explain why this submodule update was chosen and how it relates to the DTensorPolicyWorkerV2 and Checkpointer integration work.

🏁 Script executed:

#!/bin/bash # Description: Inspect the Automodel submodule divergence to understand the commit history. echo "=== Current submodule pointer ===" cat 3rdparty/Automodel-workspace/Automodel echo -e "\n=== Submodule remote info ===" cd 3rdparty/Automodel-workspace/Automodel git log --oneline -20 git log --oneline --all --graph | head -30 cd - > /dev/null echo -e "\n=== Check if current commit is reachable from main ===" cd 3rdparty/Automodel-workspace/Automodel if git merge-base --is-ancestor 5e995e9535e63cbe3358dc2bd81a8ed3a696cee7 origin/main; then echo "Current commit IS reachable from origin/main (fast-forward possible)" else echo "Current commit is NOT reachable from origin/main (divergence exists)" echo "Commits in current branch not in main:" git log origin/main..5e995e9535e63cbe3358dc2bd81a8ed3a696cee7 --oneline echo -e "\nCommits in main not in current branch:" git log 5e995e9535e63cbe3358dc2bd81a8ed3a696cee7..origin/main --oneline | head -20 fi

Length of output: 2733

I need to investigate further. The script revealed a critical issue: the target commit doesn't exist in the remote repository. Let me verify the actual state of the submodule.

🏁 Script executed:

#!/bin/bash # Verify the actual submodule state and check if the target commit exists locally cd 3rdparty/Automodel-workspace/Automodel echo "=== Current HEAD ===" git rev-parse HEAD git log -1 --oneline echo -e "\n=== Check if target commit exists in any local ref ===" if git cat-file -t 5e995e9535e63cbe3358dc2bd81a8ed3a696cee7 2>/dev/null; then echo "Target commit 5e995e9535e63cbe3358dc2bd81a8ed3a696cee7 EXISTS locally" git log -1 --oneline 5e995e9535e63cbe3358dc2bd81a8ed3a696cee7 else echo "Target commit 5e995e9535e63cbe3358dc2bd81a8ed3a696cee7 DOES NOT exist locally" fi echo -e "\n=== Check if old commit exists ===" if git cat-file -t a2db048383cd54b3fafc928df4c30bf7bbf7c430 2>/dev/null; then echo "Old commit a2db048383cd54b3fafc928df4c30bf7bbf7c430 EXISTS locally" git log -1 --oneline a2db048383cd54b3fafc928df4c30bf7bbf7c430 else echo "Old commit a2db048383cd54b3fafc928df4c30bf7bbf7c430 DOES NOT exist locally" fi echo -e "\n=== List all local refs for both commits ===" git show-ref | grep -E "5e995e9535e63cbe3358dc2bd81a8ed3a696cee7|a2db048383cd54b3fafc928df4c30bf7bbf7c430" || echo "No matching refs found" echo -e "\n=== Check .git/config for remote URL ===" git config --get remote.origin.url

Length of output: 471

Critical: Submodule pointer references non-existent commit—PR is un-mergeable.

The target commit 5e995e9535e63cbe3358dc2bd81a8ed3a696cee7 does not exist in the Automodel repository. This is not a divergence issue—the commit is simply missing and cannot be resolved.

Current state:

Submodule HEAD: 7df0cc5 ("uv lock")

PR target: 5e995e9535e63cbe3358dc2bd81a8ed3a696cee7 (does not exist)

Remote: https://github.com/NVIDIA-NeMo/RL.git (correct)

This commit pointer must be corrected or regenerated. Options:

Update the pointer to an existing commit in the Automodel repository (e.g., 7df0cc5 if that's the intended state, or another valid commit)

Push the intended commit to the Automodel remote and then reference it

Force-push to update the submodule to the correct branch/commit

Without a valid commit pointer, checkout and CI will fail.

🧰 Tools

🪛 GitHub Actions: Automodel Integration and Submodule Checks

[error] 1-1: One or more submodules are not fast-forwarded. Automodel: Commits have DIVERGED from a common ancestor. Please ensure submodule commits are fast-forwards of the main branch.

🤖 Prompt for AI Agents

In 3rdparty/Automodel-workspace/Automodel around lines 1 to 1, the submodule HEAD points to a non-existent commit (5e995e9535e63cbe3358dc2bd81a8ed3a696cee7) which makes the PR unmergeable; fix it by updating the submodule ref to a valid commit (for example change to the existing 7df0cc5 or another confirmed commit SHA), or alternatively push the missing commit to the Automodel remote and update the submodule to reference that new commit, then commit the updated .gitmodules/.git link and ensure CI/checkout succeeds.

coderabbitai · 2025-11-05T07:01:35Z

nemo_rl/models/policy/dtensor_policy_worker_v2.py

            self.optimizer = optimizer_cls(
-                self.model.parameters(), **self.cfg["optimizer"]["kwargs"]
+                self.model.parameters(),
+                **self.cfg["optimizer"]["kwargs"],
+                exp_avg_dtype=torch.bfloat16,
+                exp_avg_sq_dtype=torch.bfloat16,
            )


⚠️ Potential issue | 🔴 Critical

Don’t inject Transformer-Engine-only kwargs into arbitrary optimizers.

Lines 423-428 always pass exp_avg_dtype/exp_avg_sq_dtype. That works for TE’s FusedAdam, but any existing config that still points to torch.optim.AdamW (or similar) will now throw TypeError: __init__() got an unexpected keyword argument 'exp_avg_dtype'. Please gate those kwargs behind a TE-specific check so the generic optimizer path keeps working.

@@ - self.optimizer = optimizer_cls( - self.model.parameters(), - **self.cfg["optimizer"]["kwargs"], - exp_avg_dtype=torch.bfloat16, - exp_avg_sq_dtype=torch.bfloat16, - ) + optimizer_kwargs = dict(self.cfg["optimizer"]["kwargs"]) + if optimizer_cls.__module__.startswith( + "transformer_engine.pytorch.optimizers" + ): + optimizer_kwargs.setdefault("exp_avg_dtype", torch.bfloat16) + optimizer_kwargs.setdefault("exp_avg_sq_dtype", torch.bfloat16) + + self.optimizer = optimizer_cls( + self.model.parameters(), + **optimizer_kwargs, + )

🤖 Prompt for AI Agents

In nemo_rl/models/policy/dtensor_policy_worker_v2.py around lines 423-428, avoid passing Transformer-Engine-specific keyword args to arbitrary optimizers; detect whether optimizer_cls is the TE fused optimizer (e.g., by checking the class/module name or a TE-specific attribute) and only include exp_avg_dtype and exp_avg_sq_dtype when that check is true; for non-TE optimizers call optimizer_cls(self.model.parameters(), **self.cfg["optimizer"]["kwargs"]) so existing torch.optim.* classes don’t receive unexpected kwargs.

coderabbitai · 2025-11-05T07:01:35Z

nemo_rl/utils/venvs.py

+    if "TORCH_CUDA_ARCH_LIST" not in env:
+        env["TORCH_CUDA_ARCH_LIST"] = "9.0 10.0 12.0"
+


⚠️ Potential issue | 🟠 Major

Avoid Hopper-only default for TORCH_CUDA_ARCH_LIST.

Line 92 forces TORCH_CUDA_ARCH_LIST to "9.0 10.0 12.0" whenever the variable is unset. That drops support for Ampere (sm80/sm86), which is still a common deployment target; any native extensions built in this venv (grouped_gemm, DeepEP, etc.) will then miss sm80 kernels and crash at runtime with “no kernel image is available for execution on the device.” Please include Ampere in the default (or derive the list from an override/env hint) so older clusters don’t regress.

@@ - if "TORCH_CUDA_ARCH_LIST" not in env: - env["TORCH_CUDA_ARCH_LIST"] = "9.0 10.0 12.0" + if "TORCH_CUDA_ARCH_LIST" not in env: + env[ + "TORCH_CUDA_ARCH_LIST" + ] = os.environ.get( + "NEMO_RL_TORCH_CUDA_ARCH_LIST", + "8.0 8.6 9.0 10.0", + )

🤖 Prompt for AI Agents

In nemo_rl/utils/venvs.py around lines 92 to 94, the default TORCH_CUDA_ARCH_LIST currently sets only Hopper archs ("9.0 10.0 12.0"), which omits Ampere (sm80/sm86) and causes missing-kernel runtime crashes; update the logic so the default includes Ampere (e.g., include sm80 and/or sm86) or derive the list from an env override/hint (check CUDA_VISIBLE_DEVICES or a provided override var) and set TORCH_CUDA_ARCH_LIST to a broader set (or the override) instead of the Hopper-only value.

Signed-off-by: adil-a <[email protected]>

github-actions · 2025-11-05T07:07:06Z

⚠️ File Consistency Check

Check based on commit: 39bd74c (PR #1470 from hemil/automodel-moe)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

Please review if the changes in nemo_rl/models/policy/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/dtensor_policy_worker.py
Update nemo_rl/models/policy/dtensor_policy_worker.py if necessary to maintain consistency
If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

Modified: nemo_rl/models/policy/dtensor_policy_worker_v2.py
Not modified: nemo_rl/models/policy/dtensor_policy_worker.py

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

github-actions · 2025-11-05T07:07:17Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: 39bd74c (PR #1470 from hemil/automodel-moe)

❌ Submodules that need attention:

Automodel: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Automodel/commits/a2db048383cd54b3fafc928df4c30bf7bbf7c430/
CURRENT (PR #1470 from hemil/automodel-moe): https://github.com/NVIDIA-NeMo/Automodel/commits/5e995e9535e63cbe3358dc2bd81a8ed3a696cee7/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

Signed-off-by: adil-a <[email protected]>

github-actions · 2025-11-05T08:30:30Z

⚠️ File Consistency Check

Check based on commit: 4e151cb (PR #1470 from hemil/automodel-moe)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

Please review if the changes in nemo_rl/models/policy/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/dtensor_policy_worker.py
Update nemo_rl/models/policy/dtensor_policy_worker.py if necessary to maintain consistency
If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

Modified: nemo_rl/models/policy/dtensor_policy_worker_v2.py
Not modified: nemo_rl/models/policy/dtensor_policy_worker.py

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

github-actions · 2025-11-05T08:30:54Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: 4e151cb (PR #1470 from hemil/automodel-moe)

❌ Submodules that need attention:

Automodel: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Automodel/commits/a2db048383cd54b3fafc928df4c30bf7bbf7c430/
CURRENT (PR #1470 from hemil/automodel-moe): https://github.com/NVIDIA-NeMo/Automodel/commits/5e995e9535e63cbe3358dc2bd81a8ed3a696cee7/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

adil-a · 2025-11-05T08:30:57Z

nemo_rl/models/policy/dtensor_policy_worker_v2.py


                    with get_train_context(False, False, context_parallel_ctx)():
-                        with torch.autocast(device_type="cuda", dtype=self.dtype):
+                        with nullcontext():


Hey @terrykong @hemildesai is this going to be a problem? Do we revert back to autocast?

We should make it configurable and keep the default to be autocast for backward compatibility.

adil-a · 2025-11-05T08:31:24Z

pyproject.toml

 name = "deep_ep"
 # This version has to match the version in the commit/rev/tag used
-version = "v1.1.0+e3908bf"
+version = "v1.2.1+bfded34"


Hey @terrykong @yaoyu-33 can you comment on whether this deepep version bump is ok?

Can we try with the old version with the TORCH_CUDA_ARCH_LIST change?

Signed-off-by: adil-a <[email protected]>

github-actions · 2025-11-05T18:02:21Z

⚠️ File Consistency Check

Check based on commit: 4b6ce6d (PR #1470 from hemil/automodel-moe)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

Please review if the changes in nemo_rl/models/policy/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/dtensor_policy_worker.py
Update nemo_rl/models/policy/dtensor_policy_worker.py if necessary to maintain consistency
If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

Modified: nemo_rl/models/policy/dtensor_policy_worker_v2.py
Not modified: nemo_rl/models/policy/dtensor_policy_worker.py

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

github-actions · 2025-11-05T18:02:33Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: 4b6ce6d (PR #1470 from hemil/automodel-moe)

❌ Submodules that need attention:

Automodel: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Automodel/commits/a2db048383cd54b3fafc928df4c30bf7bbf7c430/
CURRENT (PR #1470 from hemil/automodel-moe): https://github.com/NVIDIA-NeMo/Automodel/commits/5e995e9535e63cbe3358dc2bd81a8ed3a696cee7/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

Signed-off-by: adil-a <[email protected]>

github-actions · 2025-11-05T18:12:56Z

⚠️ File Consistency Check

Check based on commit: ef2f92c (PR #1470 from hemil/automodel-moe)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

Please review if the changes in nemo_rl/models/policy/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/dtensor_policy_worker.py
Update nemo_rl/models/policy/dtensor_policy_worker.py if necessary to maintain consistency
If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

Modified: nemo_rl/models/policy/dtensor_policy_worker_v2.py
Not modified: nemo_rl/models/policy/dtensor_policy_worker.py

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

github-actions · 2025-11-05T18:13:27Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: ef2f92c (PR #1470 from hemil/automodel-moe)

❌ Submodules that need attention:

Automodel: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Automodel/commits/a2db048383cd54b3fafc928df4c30bf7bbf7c430/
CURRENT (PR #1470 from hemil/automodel-moe): https://github.com/NVIDIA-NeMo/Automodel/commits/5e995e9535e63cbe3358dc2bd81a8ed3a696cee7/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

Signed-off-by: adil-a <[email protected]>

github-actions · 2025-11-05T19:34:39Z

⚠️ File Consistency Check

Check based on commit: 1eef903 (PR #1470 from hemil/automodel-moe)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

Please review if the changes in nemo_rl/models/policy/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/dtensor_policy_worker.py
Update nemo_rl/models/policy/dtensor_policy_worker.py if necessary to maintain consistency
If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

Modified: nemo_rl/models/policy/dtensor_policy_worker_v2.py
Not modified: nemo_rl/models/policy/dtensor_policy_worker.py

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

github-actions · 2025-11-05T19:34:49Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: 1eef903 (PR #1470 from hemil/automodel-moe)

❌ Submodules that need attention:

Automodel: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Automodel/commits/a2db048383cd54b3fafc928df4c30bf7bbf7c430/
CURRENT (PR #1470 from hemil/automodel-moe): https://github.com/NVIDIA-NeMo/Automodel/commits/5e995e9535e63cbe3358dc2bd81a8ed3a696cee7/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

Signed-off-by: Adil Asif <[email protected]>

github-actions · 2025-11-06T07:30:14Z

⚠️ File Consistency Check

Check based on commit: 24214e9 (PR #1470 from hemil/automodel-moe)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

Please review if the changes in nemo_rl/models/policy/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/dtensor_policy_worker.py
Update nemo_rl/models/policy/dtensor_policy_worker.py if necessary to maintain consistency
If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

Modified: nemo_rl/models/policy/dtensor_policy_worker_v2.py
Not modified: nemo_rl/models/policy/dtensor_policy_worker.py

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

github-actions · 2025-11-06T07:30:47Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: 24214e9 (PR #1470 from hemil/automodel-moe)

❌ Submodules that need attention:

Automodel: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Automodel/commits/a2db048383cd54b3fafc928df4c30bf7bbf7c430/
CURRENT (PR #1470 from hemil/automodel-moe): https://github.com/NVIDIA-NeMo/Automodel/commits/5e995e9535e63cbe3358dc2bd81a8ed3a696cee7/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

terrykong · 2025-11-18T05:58:19Z

@adil-a what's the current status of this PR?

adil-a and others added 17 commits November 1, 2025 17:23

automodel on latest main

36db136

Signed-off-by: adil-a <[email protected]>

new automodel checkpointing

8e562e8

Signed-off-by: adil-a <[email protected]>

adding automodel sharding

5a1cff1

Signed-off-by: adil-a <[email protected]>

adding moe init

acff747

Signed-off-by: adil-a <[email protected]>

fix

0cbc3ac

Signed-off-by: adil-a <[email protected]>

removing legacy checkpointing utils

3336fe9

Signed-off-by: adil-a <[email protected]>

linting

19d29aa

Signed-off-by: adil-a <[email protected]>

adding moe check

dcb4cb2

Signed-off-by: adil-a <[email protected]>

automodel

738338f

Signed-off-by: adil-a <[email protected]>

latest automodel bump

62acdfc

Signed-off-by: adil-a <[email protected]>

changes

b6a3fdd

Signed-off-by: adil-a <[email protected]>

cfg

2b86310

Signed-off-by: adil-a <[email protected]>

eof fix

dd634b6

Signed-off-by: adil-a <[email protected]>

feat: automodel moe integration

2f74d79

Signed-off-by: Hemil Desai <[email protected]>

bump

1163407

Signed-off-by: adil-a <[email protected]>

adding torch arch list for grouped gemm isntall

d270a5b

Signed-off-by: adil-a <[email protected]>

linting

d038aca

Signed-off-by: adil-a <[email protected]>

adil-a changed the title ~~Hemil/automodel moe~~ feat: DTensorPolicyV2 GPT-OSS support Nov 5, 2025

main merge

e936ebf

Signed-off-by: adil-a <[email protected]>

adil-a marked this pull request as ready for review November 5, 2025 06:50

adil-a requested review from a team as code owners November 5, 2025 06:50

uv lock

7df0cc5

Signed-off-by: adil-a <[email protected]>

fix

b4139f1

Signed-off-by: adil-a <[email protected]>

wandb yaml fix

a55a2f1

Signed-off-by: adil-a <[email protected]>

coderabbitai bot reviewed Nov 5, 2025

View reviewed changes

minimizing yaml

39bd74c

Signed-off-by: adil-a <[email protected]>

clean up

4e151cb

Signed-off-by: adil-a <[email protected]>

adil-a commented Nov 5, 2025

View reviewed changes

dtype map

4b6ce6d

Signed-off-by: adil-a <[email protected]>

lint

ef2f92c

Signed-off-by: adil-a <[email protected]>

adil-a added the CI:L0 Run doctests and unit tests label Nov 5, 2025

adil-a temporarily deployed to nemo-ci November 5, 2025 19:32 — with GitHub Actions Inactive

removing unit test

1eef903

Signed-off-by: adil-a <[email protected]>

adil-a requested a review from a team as a code owner November 5, 2025 19:34

adding fixes from unit tests

24214e9

Signed-off-by: Adil Asif <[email protected]>

		@@ -1 +1 @@
		Subproject commit a2db048383cd54b3fafc928df4c30bf7bbf7c430
		Subproject commit 5e995e9535e63cbe3358dc2bd81a8ed3a696cee7

		if "TORCH_CUDA_ARCH_LIST" not in env:
		env["TORCH_CUDA_ARCH_LIST"] = "9.0 10.0 12.0"

feat: DTensorPolicyV2 GPT-OSS support #1470

Are you sure you want to change the base?

feat: DTensorPolicyV2 GPT-OSS support #1470

Uh oh!

Conversation

adil-a commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Uh oh!

github-actions bot commented Nov 5, 2025

⚠️ File Consistency Check

⚠️ DTensor Policy Worker Synchronization Warning

Uh oh!

github-actions bot commented Nov 5, 2025

❌ Submodule Fast-Forward Check Failed

❌ Submodules that need attention:

Uh oh!

github-actions bot commented Nov 5, 2025

⚠️ File Consistency Check

⚠️ DTensor Policy Worker Synchronization Warning

Uh oh!

github-actions bot commented Nov 5, 2025

❌ Submodule Fast-Forward Check Failed

❌ Submodules that need attention:

Uh oh!

github-actions bot commented Nov 5, 2025

❌ Submodule Fast-Forward Check Failed

❌ Submodules that need attention:

Uh oh!

github-actions bot commented Nov 5, 2025

⚠️ File Consistency Check

⚠️ DTensor Policy Worker Synchronization Warning

Uh oh!

github-actions bot commented Nov 5, 2025

❌ Submodule Fast-Forward Check Failed

❌ Submodules that need attention:

Uh oh!

coderabbitai bot commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 5, 2025

⚠️ File Consistency Check

⚠️ DTensor Policy Worker Synchronization Warning

Uh oh!

github-actions bot commented Nov 5, 2025

❌ Submodule Fast-Forward Check Failed

❌ Submodules that need attention:

Uh oh!

github-actions bot commented Nov 5, 2025

⚠️ File Consistency Check

⚠️ DTensor Policy Worker Synchronization Warning

Uh oh!

github-actions bot commented Nov 5, 2025

❌ Submodule Fast-Forward Check Failed

❌ Submodules that need attention:

Uh oh!

adil-a Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

hemildesai Nov 5, 2025

adil-a commented Nov 4, 2025 •

edited

Loading

coderabbitai bot commented Nov 5, 2025 •

edited

Loading