Skip to content

Conversation

@adil-a
Copy link

@adil-a adil-a commented Nov 4, 2025

What does this PR do ?

Adds GPT-OSS SFT using AutoModel custom models + DeepEP.

To run, launch the nightly container and run

NRL_FORCE_REBUILD_VENVS=true uv run examples/run_sft.py --config examples/configs/recipes/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.yaml cluster.gpus_per_node=8 logger.wandb_enabled=false

adil-a and others added 17 commits November 1, 2025 17:23
Signed-off-by: adil-a <[email protected]>
Signed-off-by: adil-a <[email protected]>
Signed-off-by: adil-a <[email protected]>
Signed-off-by: adil-a <[email protected]>
Signed-off-by: adil-a <[email protected]>
Signed-off-by: adil-a <[email protected]>
Signed-off-by: adil-a <[email protected]>
Signed-off-by: adil-a <[email protected]>
Signed-off-by: adil-a <[email protected]>
Signed-off-by: adil-a <[email protected]>
@adil-a adil-a changed the title Hemil/automodel moe feat: DTensorPolicyV2 GPT-OSS support Nov 5, 2025
Signed-off-by: adil-a <[email protected]>
@adil-a adil-a marked this pull request as ready for review November 5, 2025 06:50
@adil-a adil-a requested review from a team as code owners November 5, 2025 06:50
@github-actions
Copy link

github-actions bot commented Nov 5, 2025

⚠️ File Consistency Check

Check based on commit: e936ebf (PR #1470 from hemil/automodel-moe)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

  • Please review if the changes in nemo_rl/models/policy/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/dtensor_policy_worker.py
  • Update nemo_rl/models/policy/dtensor_policy_worker.py if necessary to maintain consistency
  • If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

  • Modified: nemo_rl/models/policy/dtensor_policy_worker_v2.py
  • Not modified: nemo_rl/models/policy/dtensor_policy_worker.py

This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@github-actions
Copy link

github-actions bot commented Nov 5, 2025

❌ Submodule Fast-Forward Check Failed

Check based on commit: e936ebf (PR #1470 from hemil/automodel-moe)

❌ Submodules that need attention:

Automodel: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Automodel/commits/a2db048383cd54b3fafc928df4c30bf7bbf7c430/
CURRENT (PR #1470 from hemil/automodel-moe): https://github.com/NVIDIA-NeMo/Automodel/commits/5e995e9535e63cbe3358dc2bd81a8ed3a696cee7/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

Signed-off-by: adil-a <[email protected]>
@github-actions
Copy link

github-actions bot commented Nov 5, 2025

⚠️ File Consistency Check

Check based on commit: 7df0cc5 (PR #1470 from hemil/automodel-moe)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

  • Please review if the changes in nemo_rl/models/policy/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/dtensor_policy_worker.py
  • Update nemo_rl/models/policy/dtensor_policy_worker.py if necessary to maintain consistency
  • If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

  • Modified: nemo_rl/models/policy/dtensor_policy_worker_v2.py
  • Not modified: nemo_rl/models/policy/dtensor_policy_worker.py

This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@github-actions
Copy link

github-actions bot commented Nov 5, 2025

❌ Submodule Fast-Forward Check Failed

Check based on commit: 7df0cc5 (PR #1470 from hemil/automodel-moe)

❌ Submodules that need attention:

Automodel: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Automodel/commits/a2db048383cd54b3fafc928df4c30bf7bbf7c430/
CURRENT (PR #1470 from hemil/automodel-moe): https://github.com/NVIDIA-NeMo/Automodel/commits/5e995e9535e63cbe3358dc2bd81a8ed3a696cee7/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

Signed-off-by: adil-a <[email protected]>
@github-actions
Copy link

github-actions bot commented Nov 5, 2025

❌ Submodule Fast-Forward Check Failed

Check based on commit: b4139f1 (PR #1470 from hemil/automodel-moe)

❌ Submodules that need attention:

Automodel: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Automodel/commits/a2db048383cd54b3fafc928df4c30bf7bbf7c430/
CURRENT (PR #1470 from hemil/automodel-moe): https://github.com/NVIDIA-NeMo/Automodel/commits/5e995e9535e63cbe3358dc2bd81a8ed3a696cee7/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

Signed-off-by: adil-a <[email protected]>
@github-actions
Copy link

github-actions bot commented Nov 5, 2025

⚠️ File Consistency Check

Check based on commit: a55a2f1 (PR #1470 from hemil/automodel-moe)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

  • Please review if the changes in nemo_rl/models/policy/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/dtensor_policy_worker.py
  • Update nemo_rl/models/policy/dtensor_policy_worker.py if necessary to maintain consistency
  • If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

  • Modified: nemo_rl/models/policy/dtensor_policy_worker_v2.py
  • Not modified: nemo_rl/models/policy/dtensor_policy_worker.py

This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@github-actions
Copy link

github-actions bot commented Nov 5, 2025

❌ Submodule Fast-Forward Check Failed

Check based on commit: a55a2f1 (PR #1470 from hemil/automodel-moe)

❌ Submodules that need attention:

Automodel: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Automodel/commits/a2db048383cd54b3fafc928df4c30bf7bbf7c430/
CURRENT (PR #1470 from hemil/automodel-moe): https://github.com/NVIDIA-NeMo/Automodel/commits/5e995e9535e63cbe3358dc2bd81a8ed3a696cee7/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 5, 2025

📝 Walkthrough

Walkthrough

The pull request updates the Automodel submodule pointer, integrates Automodel Checkpointer into DTensorPolicyWorkerV2 for improved checkpoint handling, adds a comprehensive SFT training configuration example, removes the legacy automodel_checkpoint utility module, updates package dependencies (grouped_gemm, transformer-engine, deep_ep), and enhances venv setup with CUDA architecture configuration.

Changes

Cohort / File(s) Summary
Automodel Submodule & Configuration
3rdparty/Automodel-workspace/Automodel, examples/configs/sft_automodel.yaml
Submodule pointer updated to commit 5e995e9535e63cbe3358dc2bd81a8ed3a696cee7. New comprehensive YAML configuration for SFT training with detailed sections for model scale (gpt-oss-20b), mixed-precision settings, distributed training, optimizer/scheduler, and logging backends (wandb, tensorboard, mlflow).
Checkpoint Management Integration
nemo_rl/models/policy/dtensor_policy_worker_v2.py
Replaced direct state-dict loading with Automodel Checkpointer workflow. Added checkpoint format detection (detect_checkpoint_format, _infer_checkpoint_root), persistent Checkpointer management (_ensure_checkpointer), and updated save/load paths. New imports from nemo_automodel (Checkpointer, FSDP2Manager, moe_parallelize_model). Integrated MOE/adapter model parallelization and distributed gradient utilities.
Checkpoint Utility Migration
nemo_rl/utils/automodel_checkpoint.py
Entire file removed; functionality delegated to integrated Checkpointer in DTensorPolicyWorkerV2 and Automodel APIs.
Import Path Updates
nemo_rl/models/policy/utils.py
Import path for Nemo automodel classes updated from nemo_automodel.components._transformers.auto_model to nemo_automodel._transformers.auto_model.
Infrastructure & Dependencies
nemo_rl/utils/venvs.py, pyproject.toml
venvs.py: Added explicit shutil.rmtree cleanup and TORCH_CUDA_ARCH_LIST environment setup ("9.0 10.0 12.0"). pyproject.toml: Added grouped_gemm and transformer-engine[pytorch] (2.8.0), updated deep_ep commit hash, extended override-dependencies (opencv-python-headless>=4.11.0), modified build-isolation mappings (causal-conv1d, grouped-gemm).

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant DTensorPolicyWorkerV2
    participant Checkpointer
    participant FSDP2Manager
    participant HFModel
    participant Cache

    User->>DTensorPolicyWorkerV2: load_checkpoint(path, ...)
    DTensorPolicyWorkerV2->>DTensorPolicyWorkerV2: detect_checkpoint_format(path)
    DTensorPolicyWorkerV2->>DTensorPolicyWorkerV2: _infer_checkpoint_root(path)
    DTensorPolicyWorkerV2->>HFModel: initialize empty model
    DTensorPolicyWorkerV2->>FSDP2Manager: parallelize model (FSDP2)
    DTensorPolicyWorkerV2->>DTensorPolicyWorkerV2: _ensure_checkpointer(config)
    DTensorPolicyWorkerV2->>Checkpointer: load via Automodel API
    Checkpointer->>Cache: fetch or download weights
    Checkpointer->>HFModel: apply loaded state dict
    HFModel-->>DTensorPolicyWorkerV2: model ready
    DTensorPolicyWorkerV2-->>User: checkpoint loaded
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

  • nemo_rl/models/policy/dtensor_policy_worker_v2.py: Substantial refactoring with new Checkpointer integration, format detection logic, FSDP2Manager usage, and MOE parallelization paths; requires careful validation of state-dict handling and rank-distributed checkpoint flow.
  • File deletion (nemo_rl/utils/automodel_checkpoint.py): Verify that all call sites have been migrated to the new Checkpointer-based approach in DTensorPolicyWorkerV2.
  • Import path changes: Confirm that the updated import path in utils.py and new imports in dtensor_policy_worker_v2.py align with Automodel's actual structure.
  • pyproject.toml dependency updates: Validate that new build-isolation mappings (grouped-gemm, causal-conv1d) and pinned versions are compatible with existing toolchain.

Possibly related PRs

Suggested labels

CI:L1, CI

Suggested reviewers

  • terrykong
  • yaoyu-33

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR contains major changes (checkpoint system refactoring, utility module removal, new format detection) but PR description lacks test results documentation, all testing checklist items remain unchecked, and review comments indicate critical untested issues. Document test results including unit/integration tests, performance comparisons, convergence validation, fix critical review issues, and check all testing checklist items with evidence of local test execution.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: DTensorPolicyV2 GPT-OSS support' directly corresponds to the primary change: adding GPT-OSS model support to DTensorPolicyWorkerV2 through Automodel integration.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch hemil/automodel-moe

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8762f57 and 7df0cc5.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (7)
  • 3rdparty/Automodel-workspace/Automodel (1 hunks)
  • examples/configs/sft_automodel.yaml (1 hunks)
  • nemo_rl/models/policy/dtensor_policy_worker_v2.py (20 hunks)
  • nemo_rl/models/policy/utils.py (1 hunks)
  • nemo_rl/utils/automodel_checkpoint.py (0 hunks)
  • nemo_rl/utils/venvs.py (2 hunks)
  • pyproject.toml (4 hunks)
💤 Files with no reviewable changes (1)
  • nemo_rl/utils/automodel_checkpoint.py
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts

Files:

  • nemo_rl/models/policy/utils.py
  • nemo_rl/models/policy/dtensor_policy_worker_v2.py
  • nemo_rl/utils/venvs.py
nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)

Files:

  • nemo_rl/models/policy/utils.py
  • nemo_rl/models/policy/dtensor_policy_worker_v2.py
  • nemo_rl/utils/venvs.py
examples/configs/*.yaml

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

examples/configs/*.yaml: Exemplar configs under examples/configs/.yaml must include documented defaults
When adding a new config key, reflect its recommended default in exemplar YAMLs under examples/configs/
.yaml

Files:

  • examples/configs/sft_automodel.yaml
🧠 Learnings (4)
📚 Learning: 2025-09-20T14:58:45.492Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-20T14:58:45.492Z
Learning: Applies to nemo_rl/**/*.py : Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults

Applied to files:

  • nemo_rl/models/policy/utils.py
📚 Learning: 2025-09-17T01:52:21.399Z
Learnt from: ffrujeri
Repo: NVIDIA-NeMo/RL PR: 1023
File: nemo_rl/utils/checkpoint.py:58-65
Timestamp: 2025-09-17T01:52:21.399Z
Learning: model_state_dict_keys is not intended to be part of the nemo-rl CheckpointingConfig TypedDict - it's handled at the automodel implementation layer, not as a general checkpointing configuration parameter.

Applied to files:

  • nemo_rl/models/policy/dtensor_policy_worker_v2.py
📚 Learning: 2025-10-30T20:50:44.126Z
Learnt from: adil-a
Repo: NVIDIA-NeMo/RL PR: 1440
File: examples/configs/sft_automodel.yaml:48-58
Timestamp: 2025-10-30T20:50:44.126Z
Learning: In DTensor configurations for MoE (Mixture of Experts) models, expert_parallel_size and data_parallel_size can be applied together without multiplying the GPU requirements. Expert Parallelism (EP) only applies to MoE layers, while Data Parallelism/FSDP applies to non-MoE layers. Therefore, configurations like expert_parallel_size: 8 and data_parallel_size: 8 are valid on an 8-GPU cluster for MoE models.

Applied to files:

  • nemo_rl/models/policy/dtensor_policy_worker_v2.py
📚 Learning: 2025-09-20T14:58:45.492Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-20T14:58:45.492Z
Learning: Applies to **/*.sh : Use `uv run` to execute Python scripts in shell/driver scripts instead of activating virtualenvs and calling `python` directly

Applied to files:

  • nemo_rl/utils/venvs.py
🧬 Code graph analysis (1)
nemo_rl/models/policy/dtensor_policy_worker_v2.py (3)
nemo_rl/utils/checkpoint.py (1)
  • CheckpointingConfig (36-67)
nemo_rl/models/policy/dtensor_policy_worker.py (3)
  • create_context_parallel_ctx (456-480)
  • get_cpu_state_dict (104-134)
  • load_checkpoint (1920-1930)
nemo_rl/models/dtensor/parallelize.py (1)
  • to_local_if_dtensor (709-715)
🪛 GitHub Actions: Automodel Integration and Submodule Checks
3rdparty/Automodel-workspace/Automodel

[error] 1-1: One or more submodules are not fast-forwarded. Automodel: Commits have DIVERGED from a common ancestor. Please ensure submodule commits are fast-forwards of the main branch.

🪛 Ruff (0.14.3)
nemo_rl/models/policy/dtensor_policy_worker_v2.py

2105-2105: Loop control variable root not used within loop body

Rename unused root to _root

(B007)


2105-2105: Loop control variable dirs not used within loop body

Rename unused dirs to _dirs

(B007)

🔇 Additional comments (1)
nemo_rl/models/policy/utils.py (1)

32-32: Verify the new import path is valid in the updated nemo_automodel submodule.

The import change appears scoped and intentional—only _transformers is being reorganized out of .components, while other submodules remain unchanged. The try-except block provides appropriate fallback handling, and the imported classes are actively used in AUTOMODEL_FACTORY and resolve_model_class().

However, I could not locate test files confirming the new import path works. Please verify:

  1. The new path (nemo_automodel._transformers.auto_model) exists and is importable in the updated nemo_automodel submodule version
  2. Tests pass with this change, or if untested, manually confirm the import succeeds at runtime
  3. NEMO_AUTOMODEL_AVAILABLE resolves to True in your environment after this change

@@ -1 +1 @@
Subproject commit a2db048383cd54b3fafc928df4c30bf7bbf7c430
Subproject commit 5e995e9535e63cbe3358dc2bd81a8ed3a696cee7
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Critical: Submodule commit divergence blocks integration.

The pipeline reports that the Automodel submodule commits have diverged from a common ancestor and are not fast-forwarded. This is a blocker that must be resolved before merging.

Submodule divergence can:

  • Break reproducibility and introduce non-deterministic behavior
  • Cause merge conflicts and complicate future submodule updates
  • Obscure commit history and make debugging harder

The AI summary indicates significant integration work (Automodel Checkpointer integration in DTensorPolicyWorkerV2, removal of legacy automodel_checkpoint utility). This context makes it critical to understand what changes in the Automodel submodule underpin these refactors and ensure the submodule update is intentional and correct.

Please clarify and resolve the divergence:

  1. Understand the divergence: Run the verification script below to inspect the commit history and identify which commits diverged and why.
  2. Align the branch: Consider whether the divergence should be resolved by:
    • Rebasing the Automodel branch onto main to make it a proper fast-forward
    • Explicitly merging main into the Automodel branch if concurrent work requires it
    • Selecting specific commits via cherry-pick if only certain commits are needed (with clear justification in the PR description)
  3. Document the strategy: Update the PR description to explain why this submodule update was chosen and how it relates to the DTensorPolicyWorkerV2 and Checkpointer integration work.

🏁 Script executed:

#!/bin/bash
# Description: Inspect the Automodel submodule divergence to understand the commit history.

echo "=== Current submodule pointer ==="
cat 3rdparty/Automodel-workspace/Automodel

echo -e "\n=== Submodule remote info ==="
cd 3rdparty/Automodel-workspace/Automodel
git log --oneline -20
git log --oneline --all --graph | head -30
cd - > /dev/null

echo -e "\n=== Check if current commit is reachable from main ==="
cd 3rdparty/Automodel-workspace/Automodel
if git merge-base --is-ancestor 5e995e9535e63cbe3358dc2bd81a8ed3a696cee7 origin/main; then
  echo "Current commit IS reachable from origin/main (fast-forward possible)"
else
  echo "Current commit is NOT reachable from origin/main (divergence exists)"
  echo "Commits in current branch not in main:"
  git log origin/main..5e995e9535e63cbe3358dc2bd81a8ed3a696cee7 --oneline
  echo -e "\nCommits in main not in current branch:"
  git log 5e995e9535e63cbe3358dc2bd81a8ed3a696cee7..origin/main --oneline | head -20
fi

Length of output: 2733


I need to investigate further. The script revealed a critical issue: the target commit doesn't exist in the remote repository. Let me verify the actual state of the submodule.


🏁 Script executed:

#!/bin/bash
# Verify the actual submodule state and check if the target commit exists locally

cd 3rdparty/Automodel-workspace/Automodel

echo "=== Current HEAD ==="
git rev-parse HEAD
git log -1 --oneline

echo -e "\n=== Check if target commit exists in any local ref ==="
if git cat-file -t 5e995e9535e63cbe3358dc2bd81a8ed3a696cee7 2>/dev/null; then
  echo "Target commit 5e995e9535e63cbe3358dc2bd81a8ed3a696cee7 EXISTS locally"
  git log -1 --oneline 5e995e9535e63cbe3358dc2bd81a8ed3a696cee7
else
  echo "Target commit 5e995e9535e63cbe3358dc2bd81a8ed3a696cee7 DOES NOT exist locally"
fi

echo -e "\n=== Check if old commit exists ==="
if git cat-file -t a2db048383cd54b3fafc928df4c30bf7bbf7c430 2>/dev/null; then
  echo "Old commit a2db048383cd54b3fafc928df4c30bf7bbf7c430 EXISTS locally"
  git log -1 --oneline a2db048383cd54b3fafc928df4c30bf7bbf7c430
else
  echo "Old commit a2db048383cd54b3fafc928df4c30bf7bbf7c430 DOES NOT exist locally"
fi

echo -e "\n=== List all local refs for both commits ==="
git show-ref | grep -E "5e995e9535e63cbe3358dc2bd81a8ed3a696cee7|a2db048383cd54b3fafc928df4c30bf7bbf7c430" || echo "No matching refs found"

echo -e "\n=== Check .git/config for remote URL ==="
git config --get remote.origin.url

Length of output: 471


Critical: Submodule pointer references non-existent commit—PR is un-mergeable.

The target commit 5e995e9535e63cbe3358dc2bd81a8ed3a696cee7 does not exist in the Automodel repository. This is not a divergence issue—the commit is simply missing and cannot be resolved.

Current state:

  • Submodule HEAD: 7df0cc5 ("uv lock")
  • PR target: 5e995e9535e63cbe3358dc2bd81a8ed3a696cee7 (does not exist)
  • Remote: https://github.com/NVIDIA-NeMo/RL.git (correct)

This commit pointer must be corrected or regenerated. Options:

  1. Update the pointer to an existing commit in the Automodel repository (e.g., 7df0cc5 if that's the intended state, or another valid commit)
  2. Push the intended commit to the Automodel remote and then reference it
  3. Force-push to update the submodule to the correct branch/commit

Without a valid commit pointer, checkout and CI will fail.

🧰 Tools
🪛 GitHub Actions: Automodel Integration and Submodule Checks

[error] 1-1: One or more submodules are not fast-forwarded. Automodel: Commits have DIVERGED from a common ancestor. Please ensure submodule commits are fast-forwards of the main branch.

🤖 Prompt for AI Agents
In 3rdparty/Automodel-workspace/Automodel around lines 1 to 1, the submodule
HEAD points to a non-existent commit (5e995e9535e63cbe3358dc2bd81a8ed3a696cee7)
which makes the PR unmergeable; fix it by updating the submodule ref to a valid
commit (for example change to the existing 7df0cc5 or another confirmed commit
SHA), or alternatively push the missing commit to the Automodel remote and
update the submodule to reference that new commit, then commit the updated
.gitmodules/.git link and ensure CI/checkout succeeds.

Comment on lines 423 to 428
self.optimizer = optimizer_cls(
self.model.parameters(), **self.cfg["optimizer"]["kwargs"]
self.model.parameters(),
**self.cfg["optimizer"]["kwargs"],
exp_avg_dtype=torch.bfloat16,
exp_avg_sq_dtype=torch.bfloat16,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Don’t inject Transformer-Engine-only kwargs into arbitrary optimizers.

Lines 423-428 always pass exp_avg_dtype/exp_avg_sq_dtype. That works for TE’s FusedAdam, but any existing config that still points to torch.optim.AdamW (or similar) will now throw TypeError: __init__() got an unexpected keyword argument 'exp_avg_dtype'. Please gate those kwargs behind a TE-specific check so the generic optimizer path keeps working.

@@
-        self.optimizer = optimizer_cls(
-            self.model.parameters(),
-            **self.cfg["optimizer"]["kwargs"],
-            exp_avg_dtype=torch.bfloat16,
-            exp_avg_sq_dtype=torch.bfloat16,
-        )
+        optimizer_kwargs = dict(self.cfg["optimizer"]["kwargs"])
+        if optimizer_cls.__module__.startswith(
+            "transformer_engine.pytorch.optimizers"
+        ):
+            optimizer_kwargs.setdefault("exp_avg_dtype", torch.bfloat16)
+            optimizer_kwargs.setdefault("exp_avg_sq_dtype", torch.bfloat16)
+
+        self.optimizer = optimizer_cls(
+            self.model.parameters(),
+            **optimizer_kwargs,
+        )
🤖 Prompt for AI Agents
In nemo_rl/models/policy/dtensor_policy_worker_v2.py around lines 423-428, avoid
passing Transformer-Engine-specific keyword args to arbitrary optimizers; detect
whether optimizer_cls is the TE fused optimizer (e.g., by checking the
class/module name or a TE-specific attribute) and only include exp_avg_dtype and
exp_avg_sq_dtype when that check is true; for non-TE optimizers call
optimizer_cls(self.model.parameters(), **self.cfg["optimizer"]["kwargs"]) so
existing torch.optim.* classes don’t receive unexpected kwargs.

Comment on lines +92 to +94
if "TORCH_CUDA_ARCH_LIST" not in env:
env["TORCH_CUDA_ARCH_LIST"] = "9.0 10.0 12.0"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid Hopper-only default for TORCH_CUDA_ARCH_LIST.

Line 92 forces TORCH_CUDA_ARCH_LIST to "9.0 10.0 12.0" whenever the variable is unset. That drops support for Ampere (sm80/sm86), which is still a common deployment target; any native extensions built in this venv (grouped_gemm, DeepEP, etc.) will then miss sm80 kernels and crash at runtime with “no kernel image is available for execution on the device.” Please include Ampere in the default (or derive the list from an override/env hint) so older clusters don’t regress.

@@
-    if "TORCH_CUDA_ARCH_LIST" not in env:
-        env["TORCH_CUDA_ARCH_LIST"] = "9.0 10.0 12.0"
+    if "TORCH_CUDA_ARCH_LIST" not in env:
+        env[
+            "TORCH_CUDA_ARCH_LIST"
+        ] = os.environ.get(
+            "NEMO_RL_TORCH_CUDA_ARCH_LIST",
+            "8.0 8.6 9.0 10.0",
+        )
🤖 Prompt for AI Agents
In nemo_rl/utils/venvs.py around lines 92 to 94, the default
TORCH_CUDA_ARCH_LIST currently sets only Hopper archs ("9.0 10.0 12.0"), which
omits Ampere (sm80/sm86) and causes missing-kernel runtime crashes; update the
logic so the default includes Ampere (e.g., include sm80 and/or sm86) or derive
the list from an env override/hint (check CUDA_VISIBLE_DEVICES or a provided
override var) and set TORCH_CUDA_ARCH_LIST to a broader set (or the override)
instead of the Hopper-only value.

Signed-off-by: adil-a <[email protected]>
@github-actions
Copy link

github-actions bot commented Nov 5, 2025

⚠️ File Consistency Check

Check based on commit: 39bd74c (PR #1470 from hemil/automodel-moe)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

  • Please review if the changes in nemo_rl/models/policy/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/dtensor_policy_worker.py
  • Update nemo_rl/models/policy/dtensor_policy_worker.py if necessary to maintain consistency
  • If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

  • Modified: nemo_rl/models/policy/dtensor_policy_worker_v2.py
  • Not modified: nemo_rl/models/policy/dtensor_policy_worker.py

This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@github-actions
Copy link

github-actions bot commented Nov 5, 2025

❌ Submodule Fast-Forward Check Failed

Check based on commit: 39bd74c (PR #1470 from hemil/automodel-moe)

❌ Submodules that need attention:

Automodel: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Automodel/commits/a2db048383cd54b3fafc928df4c30bf7bbf7c430/
CURRENT (PR #1470 from hemil/automodel-moe): https://github.com/NVIDIA-NeMo/Automodel/commits/5e995e9535e63cbe3358dc2bd81a8ed3a696cee7/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

Signed-off-by: adil-a <[email protected]>
@github-actions
Copy link

github-actions bot commented Nov 5, 2025

⚠️ File Consistency Check

Check based on commit: 4e151cb (PR #1470 from hemil/automodel-moe)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

  • Please review if the changes in nemo_rl/models/policy/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/dtensor_policy_worker.py
  • Update nemo_rl/models/policy/dtensor_policy_worker.py if necessary to maintain consistency
  • If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

  • Modified: nemo_rl/models/policy/dtensor_policy_worker_v2.py
  • Not modified: nemo_rl/models/policy/dtensor_policy_worker.py

This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@github-actions
Copy link

github-actions bot commented Nov 5, 2025

❌ Submodule Fast-Forward Check Failed

Check based on commit: 4e151cb (PR #1470 from hemil/automodel-moe)

❌ Submodules that need attention:

Automodel: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Automodel/commits/a2db048383cd54b3fafc928df4c30bf7bbf7c430/
CURRENT (PR #1470 from hemil/automodel-moe): https://github.com/NVIDIA-NeMo/Automodel/commits/5e995e9535e63cbe3358dc2bd81a8ed3a696cee7/

Please ensure all submodule commits are fast-forwards of the main branch before merging.


with get_train_context(False, False, context_parallel_ctx)():
with torch.autocast(device_type="cuda", dtype=self.dtype):
with nullcontext():
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @terrykong @hemildesai is this going to be a problem? Do we revert back to autocast?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make it configurable and keep the default to be autocast for backward compatibility.

name = "deep_ep"
# This version has to match the version in the commit/rev/tag used
version = "v1.1.0+e3908bf"
version = "v1.2.1+bfded34"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @terrykong @yaoyu-33 can you comment on whether this deepep version bump is ok?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we try with the old version with the TORCH_CUDA_ARCH_LIST change?

Signed-off-by: adil-a <[email protected]>
@github-actions
Copy link

github-actions bot commented Nov 5, 2025

⚠️ File Consistency Check

Check based on commit: 4b6ce6d (PR #1470 from hemil/automodel-moe)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

  • Please review if the changes in nemo_rl/models/policy/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/dtensor_policy_worker.py
  • Update nemo_rl/models/policy/dtensor_policy_worker.py if necessary to maintain consistency
  • If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

  • Modified: nemo_rl/models/policy/dtensor_policy_worker_v2.py
  • Not modified: nemo_rl/models/policy/dtensor_policy_worker.py

This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@github-actions
Copy link

github-actions bot commented Nov 5, 2025

❌ Submodule Fast-Forward Check Failed

Check based on commit: 4b6ce6d (PR #1470 from hemil/automodel-moe)

❌ Submodules that need attention:

Automodel: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Automodel/commits/a2db048383cd54b3fafc928df4c30bf7bbf7c430/
CURRENT (PR #1470 from hemil/automodel-moe): https://github.com/NVIDIA-NeMo/Automodel/commits/5e995e9535e63cbe3358dc2bd81a8ed3a696cee7/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

Signed-off-by: adil-a <[email protected]>
@github-actions
Copy link

github-actions bot commented Nov 5, 2025

⚠️ File Consistency Check

Check based on commit: ef2f92c (PR #1470 from hemil/automodel-moe)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

  • Please review if the changes in nemo_rl/models/policy/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/dtensor_policy_worker.py
  • Update nemo_rl/models/policy/dtensor_policy_worker.py if necessary to maintain consistency
  • If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

  • Modified: nemo_rl/models/policy/dtensor_policy_worker_v2.py
  • Not modified: nemo_rl/models/policy/dtensor_policy_worker.py

This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@github-actions
Copy link

github-actions bot commented Nov 5, 2025

❌ Submodule Fast-Forward Check Failed

Check based on commit: ef2f92c (PR #1470 from hemil/automodel-moe)

❌ Submodules that need attention:

Automodel: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Automodel/commits/a2db048383cd54b3fafc928df4c30bf7bbf7c430/
CURRENT (PR #1470 from hemil/automodel-moe): https://github.com/NVIDIA-NeMo/Automodel/commits/5e995e9535e63cbe3358dc2bd81a8ed3a696cee7/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

@adil-a adil-a added the CI:L0 Run doctests and unit tests label Nov 5, 2025
Signed-off-by: adil-a <[email protected]>
@adil-a adil-a requested a review from a team as a code owner November 5, 2025 19:34
@github-actions
Copy link

github-actions bot commented Nov 5, 2025

⚠️ File Consistency Check

Check based on commit: 1eef903 (PR #1470 from hemil/automodel-moe)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

  • Please review if the changes in nemo_rl/models/policy/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/dtensor_policy_worker.py
  • Update nemo_rl/models/policy/dtensor_policy_worker.py if necessary to maintain consistency
  • If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

  • Modified: nemo_rl/models/policy/dtensor_policy_worker_v2.py
  • Not modified: nemo_rl/models/policy/dtensor_policy_worker.py

This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@github-actions
Copy link

github-actions bot commented Nov 5, 2025

❌ Submodule Fast-Forward Check Failed

Check based on commit: 1eef903 (PR #1470 from hemil/automodel-moe)

❌ Submodules that need attention:

Automodel: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Automodel/commits/a2db048383cd54b3fafc928df4c30bf7bbf7c430/
CURRENT (PR #1470 from hemil/automodel-moe): https://github.com/NVIDIA-NeMo/Automodel/commits/5e995e9535e63cbe3358dc2bd81a8ed3a696cee7/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

@github-actions
Copy link

github-actions bot commented Nov 6, 2025

⚠️ File Consistency Check

Check based on commit: 24214e9 (PR #1470 from hemil/automodel-moe)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

  • Please review if the changes in nemo_rl/models/policy/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/dtensor_policy_worker.py
  • Update nemo_rl/models/policy/dtensor_policy_worker.py if necessary to maintain consistency
  • If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

  • Modified: nemo_rl/models/policy/dtensor_policy_worker_v2.py
  • Not modified: nemo_rl/models/policy/dtensor_policy_worker.py

This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@github-actions
Copy link

github-actions bot commented Nov 6, 2025

❌ Submodule Fast-Forward Check Failed

Check based on commit: 24214e9 (PR #1470 from hemil/automodel-moe)

❌ Submodules that need attention:

Automodel: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Automodel/commits/a2db048383cd54b3fafc928df4c30bf7bbf7c430/
CURRENT (PR #1470 from hemil/automodel-moe): https://github.com/NVIDIA-NeMo/Automodel/commits/5e995e9535e63cbe3358dc2bd81a8ed3a696cee7/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

@terrykong
Copy link
Contributor

@adil-a what's the current status of this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L0 Run doctests and unit tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants