-
Notifications
You must be signed in to change notification settings - Fork 169
feat: Support qwen3-next, mcore path #1530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Yu Yao <[email protected]>
…1341) Signed-off-by: Guyue Huang <[email protected]> Signed-off-by: Guyue Huang <[email protected]> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
…#1334) Signed-off-by: Terry Kong <[email protected]>
Signed-off-by: Terry Kong <[email protected]> fix more stuff Signed-off-by: Terry Kong <[email protected]> fix Signed-off-by: Terry Kong <[email protected]>
Signed-off-by: Ahmad Kiswani <[email protected]>
Signed-off-by: Ahmad Kiswani <[email protected]>
fixed dpo conflict ? Signed-off-by: Ahmad Kiswani <[email protected]>
Signed-off-by: Ahmad Kiswani <[email protected]>
❌ Submodule Fast-Forward Check FailedCheck based on commit: ae5d7ac (PR #1530 from ✅ Submodules that are properly updated:Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward) ❌ Submodules that need attention:Automodel: ❌ Commits have DIVERGED from a common ancestor Please ensure all submodule commits are fast-forwards of the main branch before merging. |
📝 WalkthroughWalkthroughThis PR updates NeMo-RL to support Qwen3-Next-80B model integration with Megatron-LM, adding new GRPO and SFT test recipes and configurations, refactoring policy configuration type signatures, updating submodule pointers and dependencies, and removing some validation logic from MegatronPolicyWorker. Changes
Sequence Diagram(s)sequenceDiagram
participant Test as Test Runner
participant Experiment as Training Experiment
participant Logging as Log Handler
participant Metrics as Metrics Validator
Test->>Test: Initialize config (NUM_NODES, MAX_STEPS)
Test->>Test: Check if max steps reached
alt Max steps not reached
Test->>Experiment: Run training (uv run)
Experiment-->>Experiment: Training loop
Experiment-->>Logging: Write tensorboard logs
Test->>Logging: Convert TB logs to JSON
Test->>Metrics: Parse JSON metrics
Metrics->>Metrics: Validate conditions<br/>(loss, reward/val_loss, timing)
alt All metrics pass
Metrics-->>Test: Success
else Metrics fail
Metrics-->>Test: Failure
end
else Max steps reached
Test->>Test: Exit early
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Areas requiring extra attention:
Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Tip 📝 Customizable high-level summaries are now available in beta!You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.
Example instruction:
Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pyproject.toml (1)
304-310: Move per-file-ignores back to [tool.ruff.lint.per-file-ignores] to align with current ruff configuration standards.The recommended placement for per-file-ignores in pyproject.toml is under [tool.ruff.lint.per-file-ignores], but the code uses the top-level
[tool.ruff.per-file-ignores]placement, which appears in older examples. While ruff still recognizes both formats, using the nested structure maintains consistency with current ruff documentation and best practices.
🧹 Nitpick comments (4)
nemo_rl/models/policy/__init__.py (3)
15-15: Minor: consider unifying union styleMixing
Union[...](formax_grad_norm) with|unions elsewhere is legal but a bit inconsistent. You may want to standardize on one style for readability, preferably|now that you’re on Python 3.10+.
33-39: SequencePackingConfig now requiresalgorithmMaking
algorithm: strrequired means every config that turns sequence packing on must supply an algorithm name. That’s fine, but it does tighten the config surface; please double‑check all existingsequence_packinguses (e.g., seqpack SFT/GRPO recipes) specifyalgorithm, or considerNotRequired[str]if you still want a default.
71-79: Scheduler typings align with milestones pattern but could be clearer
lr_decay_iters: intinMegatronSchedulerConfigandmilestones: NotRequired[list[int]]inSinglePytorchSchedulerConfigmatch the “top‑levelmilestonesalongside scheduler config” pattern used in GRPO/DPO/distillation configs.- Allowing
PolicyConfig.scheduler: list[SinglePytorchSchedulerConfig] | SchedulerMilestonescaptures both the standard list form and the milestones‑only dict form, but the union is a bit opaque for readers and tools.If this proves confusing, a follow‑up refactor to introduce a dedicated TypedDict for the “SequentialLR + milestones” shape (instead of raw
dict[str, list[int]]) would make the config schema more self‑documenting. Based on learnings.Also applies to: 133-137, 139-140, 178-179
examples/configs/recipes/llm/sft-qwen3-next-80ba3b-8n8g-megatron.yaml (1)
1-7: Clarifypolicy.optimizer: nullsemanticsYou’re inheriting from
sft.yamland explicitly settingpolicy.optimizer: nullwhile relying onpolicy.megatron_cfg.optimizerfor Megatron. That’s reasonable, but only if downstream code treatsoptimizeras optional / nullable rather than using'optimizer' in policy_cfgor assuming it’s always a mapping.If the intent is “no PyTorch optimizer, only Megatron optimizer”, please confirm the loader logic handles
Nonecorrectly; otherwise you might prefer removing the key entirely at the base config or gating onmegatron_cfg.enabledinstead.Also applies to: 8-19
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (21)
.gitignore(1 hunks).gitmodules(1 hunks)3rdparty/Automodel-workspace/Automodel(1 hunks)3rdparty/Megatron-Bridge-workspace/Megatron-Bridge(1 hunks)3rdparty/Megatron-Bridge-workspace/setup.py(1 hunks)3rdparty/Megatron-LM-workspace/Megatron-LM(1 hunks)examples/configs/grpo_math_1B.yaml(1 hunks)examples/configs/recipes/llm/grpo-deepscaler-1.5b-16K.yaml(1 hunks)examples/configs/recipes/llm/grpo-deepscaler-1.5b-24K.yaml(1 hunks)examples/configs/recipes/llm/grpo-qwen3-next-80ba3b-8n8g-megatron.yaml(1 hunks)examples/configs/recipes/llm/sft-qwen3-next-80ba3b-8n8g-megatron.yaml(1 hunks)examples/configs/sft.yaml(3 hunks)examples/configs/vlm_grpo_3B.yaml(3 hunks)examples/configs/vlm_grpo_3B_megatron.yaml(4 hunks)nemo_rl/models/policy/__init__.py(6 hunks)nemo_rl/models/policy/megatron_policy_worker.py(3 hunks)nemo_rl/models/policy/utils.py(1 hunks)pyproject.toml(4 hunks)tests/test_suites/llm/grpo-qwen3-next-80ba3b-8n8g-megatron.sh(1 hunks)tests/test_suites/llm/sft-qwen3-next-80ba3b-8n8g-megatron.sh(1 hunks)tests/test_suites/nightly.txt(2 hunks)
🧰 Additional context used
🧠 Learnings (8)
📚 Learning: 2025-10-30T20:50:44.126Z
Learnt from: adil-a
Repo: NVIDIA-NeMo/RL PR: 1440
File: examples/configs/sft_automodel.yaml:48-58
Timestamp: 2025-10-30T20:50:44.126Z
Learning: In DTensor configurations for MoE (Mixture of Experts) models, expert_parallel_size and data_parallel_size can be applied together without multiplying the GPU requirements. Expert Parallelism (EP) only applies to MoE layers, while Data Parallelism/FSDP applies to non-MoE layers. Therefore, configurations like expert_parallel_size: 8 and data_parallel_size: 8 are valid on an 8-GPU cluster for MoE models.
Applied to files:
examples/configs/grpo_math_1B.yaml
📚 Learning: 2025-10-12T14:46:57.171Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:6-11
Timestamp: 2025-10-12T14:46:57.171Z
Learning: Test scripts in tests/test_suites/llm/ follow a standard configuration pattern that includes NUM_NODES, STEPS_PER_RUN, MAX_STEPS, NUM_RUNS (calculated as `$(( (MAX_STEPS + STEPS_PER_RUN - 1) / STEPS_PER_RUN ))`), and NUM_MINUTES. These variables are part of the test infrastructure's standard interface and should not be flagged as unused even if not directly referenced within the individual script, as they are consumed by external launch tooling or common.env.
Applied to files:
tests/test_suites/llm/grpo-qwen3-next-80ba3b-8n8g-megatron.shtests/test_suites/llm/sft-qwen3-next-80ba3b-8n8g-megatron.sh
📚 Learning: 2025-10-12T14:46:55.513Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:16-30
Timestamp: 2025-10-12T14:46:55.513Z
Learning: In the NVIDIA-NeMo/RL repository, test scripts under tests/ follow a consistent pattern: use `cd $PROJECT_ROOT` without quotes or error handling, and pass arguments with `$@` unquoted. Maintain this consistency when adding new test scripts.
Applied to files:
tests/test_suites/llm/grpo-qwen3-next-80ba3b-8n8g-megatron.sh
📚 Learning: 2025-09-18T13:26:43.307Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1006
File: examples/configs/recipes/llm/distillation-qwen3-32b-to-8b-base-2n8g-fsdp2tp2.v1.yaml:19-26
Timestamp: 2025-09-18T13:26:43.307Z
Learning: In on-policy distillation workflows, validation can use downstream task performance (like math problem solving) as RL-like reward metrics rather than traditional distillation metrics like KL divergence. In this case, "val_reward" with "higher_is_better: true" is the correct checkpoint monitoring configuration.
Applied to files:
examples/configs/vlm_grpo_3B.yamlexamples/configs/vlm_grpo_3B_megatron.yamlexamples/configs/sft.yaml
📚 Learning: 2025-09-19T03:00:58.662Z
Learnt from: shuo-nvidia
Repo: NVIDIA-NeMo/RL PR: 1006
File: examples/configs/recipes/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-fsdp2tp1.v1.yaml:85-101
Timestamp: 2025-09-19T03:00:58.662Z
Learning: In distillation and GRPO configurations, max_new_tokens is intentionally set to the full context window (max_total_sequence_length) for consistency across the codebase. Overflow cases when prompt + generation tokens exceed max_model_len are handled by safeguards implemented in vllm_worker.py.
Applied to files:
examples/configs/recipes/llm/grpo-qwen3-next-80ba3b-8n8g-megatron.yaml
📚 Learning: 2025-09-17T01:52:21.399Z
Learnt from: ffrujeri
Repo: NVIDIA-NeMo/RL PR: 1023
File: nemo_rl/utils/checkpoint.py:58-65
Timestamp: 2025-09-17T01:52:21.399Z
Learning: model_state_dict_keys is not intended to be part of the nemo-rl CheckpointingConfig TypedDict - it's handled at the automodel implementation layer, not as a general checkpointing configuration parameter.
Applied to files:
nemo_rl/models/policy/megatron_policy_worker.pynemo_rl/models/policy/__init__.py
📚 Learning: 2025-09-19T02:44:38.451Z
Learnt from: shuo-nvidia
Repo: NVIDIA-NeMo/RL PR: 1006
File: examples/configs/recipes/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-fsdp2tp1.v1.yaml:73-84
Timestamp: 2025-09-19T02:44:38.451Z
Learning: The scheduler configuration format with a separate "milestones: [20]" entry (not wrapped under name/kwargs) is a valid and established pattern used across GRPO, DPO, and distillation configs in the NeMo RL codebase. This format specifies transition points between different schedulers (e.g., LinearLR for warmup steps, then ConstantLR).
Applied to files:
nemo_rl/models/policy/__init__.py
📚 Learning: 2025-09-24T18:36:06.287Z
Learnt from: terrykong
Repo: NVIDIA-NeMo/RL PR: 1024
File: examples/configs/recipes/llm/dpo-llama3.1-8b-instruct-4n8g-fsdp2tp4.yaml:1-1
Timestamp: 2025-09-24T18:36:06.287Z
Learning: In the NVIDIA NeMo RL repository, when working with Hydra config defaults, the scalar string format (defaults: ../../dpo.yaml) is acceptable and preferred over the list format, even though Hydra typically expects defaults to be a list.
Applied to files:
examples/configs/recipes/llm/grpo-deepscaler-1.5b-16K.yaml
🧬 Code graph analysis (1)
nemo_rl/models/policy/__init__.py (1)
nemo_rl/models/generation/interfaces.py (1)
GenerationConfig(118-131)
🪛 GitHub Actions: Automodel Integration and Submodule Checks
3rdparty/Automodel-workspace/Automodel
[error] 1-1: Submodule Automodel has diverged from the main branch. TARGET (main branch): https://github.com/NVIDIA-NeMo/Automodel/commits/a2db048383cd54b3fafc928df4c30bf7bbf7c430/ CURRENT (PR #1530 from ahmadki/qwen3-next): https://github.com/NVIDIA-NeMo/Automodel/commits/45466962771e9aac6f6f26eb05a0b287983a9b33/ This indicates parallel development; manual merge may be required.
🪛 Shellcheck (0.11.0)
tests/test_suites/llm/grpo-qwen3-next-80ba3b-8n8g-megatron.sh
[warning] 6-6: NUM_NODES appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 9-9: NUM_RUNS appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 10-10: NUM_MINUTES appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 16-16: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.
(SC2164)
[error] 28-28: Double quote array expansions to avoid re-splitting elements.
(SC2068)
tests/test_suites/llm/sft-qwen3-next-80ba3b-8n8g-megatron.sh
[warning] 6-6: NUM_NODES appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 9-9: NUM_RUNS appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 10-10: NUM_MINUTES appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 16-16: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.
(SC2164)
[error] 28-28: Double quote array expansions to avoid re-splitting elements.
(SC2068)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: semantic-pull-request / semantic-pull-request
- GitHub Check: Lint check
- GitHub Check: Post automodel integration comment / Comment on PR
🔇 Additional comments (19)
nemo_rl/models/policy/utils.py (1)
295-295: No changes required; the commented assertion is an intentional workaround.The commented-out assertion on line 295 is part of a documented temporary workaround for HuggingFace issue #38002 (referenced in the TODO on line 290). This is not dead code but rather a controlled relaxation of validation to support models like qwen3-next that may lack the
sliding_windowattribute while havinguse_sliding_window=False.Backward compatibility is maintained: the function sets
"sliding_window": Noneregardless of whether the attribute exists, and existing tests intests/unit/models/policy/test_dtensor_worker.pyandtests/unit/models/policy/test_dtensor_worker_v2.pyextensively exercise this code path with qwen2 and qwen3 models, confirming that models both with and without the attribute continue to work.pyproject.toml (3)
54-54: No critical issues found, but verify build-time dependency duplication.Verification results:
- ✓ Torch 2.8.0 compatibility: transformers ≥4.45.0 (required by rwkv-fla) is compatible with torch 2.8.0
- ✓ Version string intentional:
0.7.202508221413is a specific date-based build identifier (2025-08-22 14:13)- Duplication context:
rwkv-fla>=0.7.202508221413appears in both main dependencies (line 54) and mcore optional-dependencies (line 104) as separate dependency groups. This pattern may be intentional for mcore-specific builds or could be redundant; the codebase doesn't provide clear indication either way.Recommend confirming with the team whether both main and mcore sections need rwkv-fla, or if mcore inherits from main dependencies.
28-28: Pin is correct and necessary — Qwen3-Next support justifies the specific version.Qwen3-Next was added to Transformers in v4.57.0 (September 2025), making v4.57.1 the stable patch release required for Qwen3-Next support. Earlier versions are incompatible, so the specific pinning prevents accidental downgrades to unsupported versions. v4.57.1 is a small patch release with bug fixes and no breaking API changes, confirming compatibility with the rest of the dependency stack.
50-50: Remove the mlflow constraint concern—the original premise is invalid.MLflow 3.6.0 was released Nov 7, 2025, and MLflow 4.0 has not been released. The constraint relaxation from
>=3.5.0,<3.6.0to baremlflowsimply allows newer 3.x releases without re-pinning, which is reasonable during development. There is no exposure to 4.x breaking changes..gitignore (1)
50-52: Good practice to exclude local environment variable files.Adding
.envand.envrcto the gitignore prevents accidental commits of local development configurations and secrets. This aligns with standard development practices.examples/configs/recipes/llm/grpo-deepscaler-1.5b-24K.yaml (1)
1-3: Minor formatting adjustment to defaults list.The reordering of defaults entries has no semantic impact on configuration inheritance. Both entries remain functionally identical in their order.
examples/configs/grpo_math_1B.yaml (1)
154-160: Parameter verified as valid—approve changes.The
average_in_collectiveparameter is confirmed as a valid boolean in Megatron-LM's DistributedDataParallelConfig, where it controls gradient averaging inside the collective operation. The configuration change is appropriate and consistent with Megatron-LM's API.examples/configs/recipes/llm/grpo-deepscaler-1.5b-16K.yaml (2)
1-3: YAML list syntax correction is correct.The fix properly restores valid YAML list syntax by removing extraneous dashes from each list item. This enables Hydra to correctly parse the defaults configuration. With multiple defaults items, a proper YAML list structure (single dash per item) is required.
2-3: Configuration references verified and are compatible.Both config files exist, contain valid YAML syntax, and are compatible:
examples/configs/grpo_math_1B.yaml✓examples/configs/recipes/llm/grpo-deepscaler-1.5b-8K.yaml✓The second file correctly uses a
defaultsinheritance pattern with selective parameter overrides (e.g., adjusted training batch sizes, learning rate, and model specifics), which is a standard and compatible configuration composition approach.3rdparty/Megatron-LM-workspace/Megatron-LM (1)
1-1: Submodule update is targeted for Qwen3-Next support; verify through test execution.The Megatron-LM update is tied to git commit f9df826 ("qwen3-next mcore path recipes") on a custom fork branch specifically for RL+Qwen3 integration. Verification confirms:
- Dependency constraints are in place (
megatron-core>=0.15.0a0,<0.17.0) with consistency validation in setup.py- Integration is config-driven via Megatron-Bridge abstraction (YAML recipes), not direct code imports, reducing API coupling
- Functional tests exist (grpo_megatron.sh, sft_megatron.sh, dpo_megatron.sh, distillation_megatron.sh)
Recommended: Run the provided Megatron functional tests locally to validate integration and ensure no regressions with the Qwen3-Next recipes.
.gitmodules (1)
3-4: Confirm the purpose and timeline for the personal fork and feature branch.The verification confirms these are intentional changes to use your personal forks (
ahmadki/Megatron-LMandahmadki/Automodel) on matching feature branches (ahmadki/rl-qwen3-from-nemo-rl-submodule). This is part of active qwen3-next development, and the previous version also used a personal fork.However, no documented merge plan or timeline was found in the commit history. To clear this concern:
- Confirm whether these changes are intended to remain temporary until merged upstream to the official NVIDIA-NeMo repositories, or if they are intended as a longer-term pattern.
- If temporary, document the expected merge timeline or add a note in the code or PR description for clarity.
3rdparty/Megatron-Bridge-workspace/setup.py (1)
40-40: I'll search for the official release notes and breaking changes for megatron-core 0.16.0.The first search had limited details. Let me search for more specific information about megatron-core version history and the specific 0.16.0 release.
Let me search for more information about planned Megatron-Core releases and any announced breaking changes.
Verify that the version range actually targets intended megatron-core versions.
The latest released version on PyPI is megatron-core 0.14.0 (October 8, 2025). The dependency specification allows
>=0.15.0a0,<0.17.0, which permits versions 0.15.x and 0.16.x that have not yet been released. Since these versions don't exist publicly, it's not possible to verify breaking changes against them at this time.Clarify whether:
- You intend to support released versions (currently up to 0.14.0) or wait for future releases
- The lower bound
0.15.0a0is correct, or if0.14.0is the intended minimum- You have specific knowledge of breaking changes planned for 0.15.x or 0.16.x from NVIDIA's roadmap
nemo_rl/models/policy/__init__.py (3)
20-31: DTensor, dynamic batching, and sequence packing gating look consistentThe pattern of required
dtensor_cfg/dynamic_batchingblocks with anenabled: boolflag (and other fieldsNotRequired[...]) looks coherent and matches how the new recipes usedtensor_cfg.enabled=falseplusmegatron_cfg.enabled=true. This should keep the config surface explicit while allowing minimal stubs like{enabled: false}.Also applies to: 142-153, 169-173
81-88: Megatron config and DDP additions look correct
- Adding
average_in_collective: booltoMegatronDDPConfigmatches how new configs setdistributed_data_parallel_config.average_in_collective=true.- Moving several Megatron fields to
NotRequired[...]and wideningmax_grad_normtofloat | intmatches real‑world usage and should reduce spurious type errors, while keepingenabled: boolrequired ensures explicit selection of the Megatron path.No obvious inconsistencies with the new Megatron Qwen3‑next recipes.
Also applies to: 90-118, 176-176
121-126: Tokenizerchat_template_kwargsandlogprob_chunk_sizeextensionsTyping
chat_template_kwargsasNotRequired[dict[str, Any]]and adding optionallogprob_chunk_size: NotRequired[int]onPolicyConfiggives enough flexibility for different tokenizers and logprob sharding strategies without over‑constraining the schema. This matches the looser generation config typing pattern shown inGenerationConfig.Also applies to: 160-166
tests/test_suites/nightly.txt (1)
51-53: Nightly wiring for Qwen3-next GRPO/SFT looks goodThe new Qwen3-next GRPO and SFT Megatron scripts are registered under the expected GRPO/SFT sections with consistent naming; this should integrate cleanly into the nightly suite.
Also applies to: 76-78
examples/configs/recipes/llm/sft-qwen3-next-80ba3b-8n8g-megatron.yaml (1)
8-18: Overall SFT Qwen3-next Megatron recipe looks coherent
- dtensor is explicitly disabled while Megatron is enabled, aligning with the updated
PolicyConfigtypings.make_sequence_length_divisible_by: ${policy.megatron_cfg.tensor_model_parallel_size}is consistent with existing recipes that tie divisibility to TP size.- Data, logging, and cluster settings are in line with other large-model SFT recipes.
No additional structural issues spotted.
Also applies to: 31-47
tests/test_suites/llm/grpo-qwen3-next-80ba3b-8n8g-megatron.sh (1)
5-13: GRPO Qwen3-next test harness matches existing patternsThe NUM_* config block,
exit_if_max_steps_reached,uv runinvocation, TB→JSON conversion, and jq/metric checks all mirror existing GRPO Megatron tests, including use ofcommon.envvariables and unquoted$@. No additional changes needed.Also applies to: 16-33
tests/test_suites/llm/sft-qwen3-next-80ba3b-8n8g-megatron.sh (1)
5-13: SFT Qwen3-next Megatron test script looks solidThe script follows the standard tests/common.env pattern (NUM_* vars,
exit_if_max_steps_reached,uv run, TB→JSON, gated metric checks). Metrics keyed at step 200 align withMAX_STEPS=200, and the bash syntax is correct.Also applies to: 16-40
| @@ -1 +1 @@ | |||
| Subproject commit a2db048383cd54b3fafc928df4c30bf7bbf7c430 | |||
| Subproject commit 45466962771e9aac6f6f26eb05a0b287983a9b33 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Resolve submodule divergence flagged by pipeline.
The GitHub Actions pipeline has detected that the Automodel submodule has diverged from the main branch. The current PR points to commit 45466962771e9aac6f6f26eb05a0b287983a9b33, while main is at a2db048383cd54b3fafc928df4c30bf7bbf7c430. The pipeline indicates "manual merge may be required," which could block PR integration.
Please clarify:
- Is this submodule divergence intentional for qwen3-next support?
- What commits were added in the new branch and why weren't they in main?
- Has this been coordinated with the Automodel maintainers, or does main need to be updated first?
Run the following script to investigate the divergence:
🏁 Script executed:
#!/bin/bash
# Description: Investigate Automodel submodule divergence between branches
# Show commit details for both the current PR commit and main branch commit
echo "=== PR Commit (45466962771e9aac6f6f26eb05a0b287983a9b33) ==="
curl -s https://api.github.com/repos/NVIDIA-NeMo/Automodel/commits/45466962771e9aac6f6f26eb05a0b287983a9b33 | jq '{sha: .sha, message: .commit.message, author: .commit.author.name, date: .commit.author.date}'
echo ""
echo "=== Main Branch Commit (a2db048383cd54b3fafc928df4c30bf7bbf7c430) ==="
curl -s https://api.github.com/repos/NVIDIA-NeMo/Automodel/commits/a2db048383cd54b3fafc928df4c30bf7bbf7c430 | jq '{sha: .sha, message: .commit.message, author: .commit.author.name, date: .commit.author.date}'
echo ""
echo "=== Checking commit history to find common ancestor ==="
curl -s "https://api.github.com/repos/NVIDIA-NeMo/Automodel/compare/a2db048383cd54b3fafc928df4c30bf7bbf7c430...45466962771e9aac6f6f26eb05a0b287983a9b33" | jq '{ahead_by: .ahead_by, behind_by: .behind_by, diverged: (.ahead_by > 0 and .behind_by > 0)}'Length of output: 1152
Resolve submodule divergence—PR targets outdated Automodel commit.
The Automodel submodule has confirmed divergence: the PR points to commit 45466962771e9aac6f6f26eb05a0b287983a9b33 (Sept 10, 2025), while main is at a2db048383cd54b3fafc928df4c30bf7bbf7c430 (Oct 10, 2025). Both branches have evolved independently (109 commits ahead, 3 commits behind).
More importantly, main contains critical updates: transformer-engine 2.5 → 2.8 and CUDA 12.8 → 12.9 (applied Oct 10). The PR's commit predates these by ~1 month.
Action required:
- Confirm whether qwen3-next support intentionally requires the Sept 10 commit despite missing the latest CUDA 12.9 and TE 2.8 updates.
- If the latest main is compatible with qwen3-next, rebase/update the submodule pointer to
a2db048383cd54b3fafc928df4c30bf7bbf7c430. - If the Sept 10 commit is necessary, document why the newer CUDA/TE updates cannot be included.
🧰 Tools
🪛 GitHub Actions: Automodel Integration and Submodule Checks
[error] 1-1: Submodule Automodel has diverged from the main branch. TARGET (main branch): https://github.com/NVIDIA-NeMo/Automodel/commits/a2db048383cd54b3fafc928df4c30bf7bbf7c430/ CURRENT (PR #1530 from ahmadki/qwen3-next): https://github.com/NVIDIA-NeMo/Automodel/commits/45466962771e9aac6f6f26eb05a0b287983a9b33/ This indicates parallel development; manual merge may be required.
🤖 Prompt for AI Agents
In 3rdparty/Automodel-workspace/Automodel around lines 1-1 the PR points the
Automodel submodule at an outdated commit (4546696...) that is diverged from
main (a2db048...); you must either confirm the Sept 10 commit is intentionally
required or update the submodule to main — if qwen3-next does not need the older
commit, fetch the Automodel repo, checkout or fetch main at commit
a2db048383cd54b3fafc928df4c30bf7bbf7c430, update the superproject submodule
pointer to that commit (git submodule update --init; cd
3rdparty/Automodel-workspace/Automodel; git fetch && git checkout a2db048...; cd
../..; git add 3rdparty/Automodel-workspace/Automodel; git commit -m "bump
Automodel submodule to a2db048... (Oct 10)"), run CI/build and any
CUDA/transformer-engine compatibility tests, and push the submodule-pointer
commit; if the older Sept 10 commit is required, add a brief rationale in the PR
describing why the October updates (CUDA 12.9 / TE 2.8) cannot be used and
document any blockers and test results.
| @@ -1 +1 @@ | |||
| Subproject commit f003cd8ca3e4876853b6097e816f0a94ea8fefc1 | |||
| Subproject commit b6728896fd2215da024981169f30baaee1e70e56 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Verify submodule commit and dependency compatibility with version bumps.
The submodule pointer is being updated alongside dependency constraint changes (megatron-core <0.16.0 → <0.17.0, new transformers>=4.57.1). Ensure the new commit in Megatron-Bridge is compatible with these changes and that no breaking changes exist in the upgraded versions.
Run the following checks to verify the changes:
Additionally, verify that the setup.py dependency constraint updates are intentional and documented:
- Confirm megatron-core version bump from <0.16.0 to <0.17.0 aligns with Qwen3-Next support
- Confirm transformers>=4.57.1 requirement is necessary for model tokenization or features
🏁 Script executed:
#!/bin/bash
# Description: Verify Megatron-Bridge submodule commit exists and validate dependency metadata
# Check if the commit hash is valid and retrieve commit info
git ls-remote --heads --tags https://github.com/NVIDIA/Megatron-Bridge.git b6728896fd2215da024981169f30baaee1e70e56 || \
echo "Commit not found; checking if it exists in the repository..."
# Alternative: Clone and check directly
git clone --depth 1 https://github.com/NVIDIA/Megatron-Bridge.git /tmp/megatron-bridge-check 2>/dev/null && \
cd /tmp/megatron-bridge-check && \
git log --oneline -1 b6728896fd2215da024981169f30baaee1e70e56 2>/dev/null || \
echo "Commit verification: unable to confirm commit exists"
# Verify PyPI availability of dependencies with version constraints
echo "=== Checking PyPI for megatron-core versions ==="
pip index versions megatron-core 2>/dev/null | head -20
echo "=== Checking PyPI for transformers versions ==="
pip index versions transformers 2>/dev/null | head -20Length of output: 2025
Let me verify the actual setup.py file changes to understand the dependency constraints better:
🏁 Script executed:
#!/bin/bash
# Find and display setup.py to verify dependency version constraints
find . -name "setup.py" -type f | head -5 | while read setup_file; do
echo "=== File: $setup_file ==="
grep -n "megatron-core\|transformers" "$setup_file" || echo "No megatron-core or transformers found"
echo ""
done
# Also check setup.cfg and pyproject.toml if they exist
find . -name "setup.cfg" -o -name "pyproject.toml" | head -5 | while read config_file; do
echo "=== File: $config_file ==="
grep -n "megatron-core\|transformers" "$config_file" || echo "No megatron-core or transformers found"
echo ""
doneLength of output: 769
Let me check the git history to see what the previous dependency constraint was:
🏁 Script executed:
#!/bin/bash
# Check git history for setup.py changes to Megatron-Bridge-workspace
cd ./git && git log --oneline -20 -- 3rdparty/Megatron-Bridge-workspace/setup.py 2>/dev/null || echo "No git history available"
echo "=== Current setup.py dependency constraints ==="
grep -A2 -B2 "megatron-core\|transformers" 3rdparty/Megatron-Bridge-workspace/setup.py | head -20
echo ""
echo "=== Checking if .gitmodules exists for submodule info ==="
if [ -f .gitmodules ]; then
echo "=== .gitmodules content ==="
cat .gitmodules
else
echo ".gitmodules not found"
fiLength of output: 1068
Let me verify the commit with the correct repository URL and check the git diff for setup.py changes:
🏁 Script executed:
#!/bin/bash
# Verify commit with CORRECT Megatron-Bridge URL from .gitmodules
echo "=== Verifying commit b6728896fd2215da024981169f30baaee1e70e56 in correct repo ==="
git ls-remote --heads --tags https://github.com/NVIDIA-NeMo/Megatron-Bridge.git b6728896fd2215da024981169f30baaee1e70e56 2>/dev/null || echo "Commit not found as branch/tag"
echo ""
echo "=== Checking if commit exists in repository history ==="
git ls-remote https://github.com/NVIDIA-NeMo/Megatron-Bridge.git | head -5
echo ""
echo "=== Get git diff for setup.py to see what changed ==="
git diff HEAD~1 HEAD -- 3rdparty/Megatron-Bridge-workspace/setup.py 2>/dev/null || echo "Git diff not available or no prior commit"
echo ""
echo "=== Current HEAD info ==="
git log -1 --oneline 2>/dev/null || echo "Git log not available"Length of output: 557
Let me check the current state of .gitmodules and the submodule pointer to understand what's being changed:
🏁 Script executed:
#!/bin/bash
# Check current submodule state and what pointer is being set
echo "=== Current submodule commit pointers ==="
git config --file .gitmodules --get-regexp path | while read path_line; do
path=$(echo "$path_line" | awk '{print $2}')
echo "Path: $path"
if [ -d "$path/.git" ]; then
cd "$path" 2>/dev/null && git rev-parse HEAD 2>/dev/null && cd - > /dev/null
else
echo " (not initialized)"
fi
done
echo ""
echo "=== Check git ls-files for submodule information ==="
git ls-files --stage | grep 160000
echo ""
echo "=== Search for commit hash in codebase (may indicate intended change) ==="
grep -r "b6728896fd2215da024981169f30baaee1e70e56" . --exclude-dir=.git 2>/dev/null || echo "Commit hash not found in codebase"Length of output: 698
Critical: Megatron-Bridge submodule commit does not exist in remote repository.
The submodule pointer b6728896fd2215da024981169f30baaee1e70e56 cannot be resolved in https://github.com/NVIDIA-NeMo/Megatron-Bridge.git. The current HEAD in that repository is 4ab7b62b40eb0f43f51bb5a323082e6af18173de. Users attempting to clone and initialize submodules will encounter a "commit not found" error, causing build failures.
Verify that:
- The commit hash is correct
- The commit has been pushed to the Megatron-Bridge repository main branch
- Or provide the correct commit hash that exists in the repository
The dependency versions (megatron-core pre-release and transformers 4.57.1) are valid and available.
🤖 Prompt for AI Agents
In 3rdparty/Megatron-Bridge-workspace/Megatron-Bridge around lines 1-1, the
submodule commit pointer b6728896fd2215da024981169f30baaee1e70e56 does not exist
in the remote https://github.com/NVIDIA-NeMo/Megatron-Bridge.git; update the
submodule reference to a valid pushed commit (e.g., the current remote HEAD
4ab7b62b40eb0f43f51bb5a323082e6af18173de) or replace it with the correct commit
hash that exists in the remote, then commit the updated .gitmodules/.git link
(or re-add the submodule) so users can clone and init submodules without a
"commit not found" error.
| "transformer-engine[pytorch]>=2.9.0a0,<2.10.0", | ||
| "mamba-ssm", | ||
| "nvidia-resiliency-ext", | ||
| "transformers>=4.57.1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Verify transformers version 4.57.1 availability and qwen3-next requirements.
The dependency on transformers>=4.57.1 has been added. Please confirm:
- This version exists and is available on PyPI
- This specific version is required for qwen3-next model support
- There are no known compatibility issues with other dependencies
🌐 Web query:
What is the latest version of transformers library on PyPI?
💡 Result:
The latest version on PyPI is transformers 4.57.1 — uploaded Oct 14, 2025. [1]
Sources
[1] PyPI — transformers release files (transformers‑4.57.1).
---
🏁 Script executed:
```shell
#!/bin/bash
# Search for qwen3-next references and version rationale in codebase
rg -i 'qwen3-next' -A 3 -B 3
Length of output: 5094
🏁 Script executed:
#!/bin/bash
# Check full setup.py to see all dependencies and constraints
cat 3rdparty/Megatron-Bridge-workspace/setup.pyLength of output: 3891
---
🏁 Script executed:
```shell
#!/bin/bash
# Check if Megatron-Bridge submodule has pyproject.toml and its transformers spec
cat 3rdparty/Megatron-Bridge/pyproject.toml 2>/dev/null | grep -A 50 dependencies
Length of output: 0
🏁 Script executed:
#!/bin/bash
# Search for any comments or documentation about transformers version in codebase
rg 'transformers' -i --type yaml --type py -B 2 -A 2 | grep -i 'version\|4\.57\|qwen' | head -30Length of output: 2294
---
🏁 Script executed:
```shell
#!/bin/bash
# View full test_transformer_memory_regression.py to understand memory regression issue
cat tests/unit/test_transformer_memory_regression.py
Length of output: 3092
🏁 Script executed:
#!/bin/bash
# Search for skip decorators or conditionals around this test
rg 'test_transformers_version_memory_regression' -B 10 -A 15Length of output: 2446
🏁 Script executed:
#!/bin/bash
# Check for any recent changes or comments about transformers version bump
git log --oneline --all -S 'transformers>=4.57.1' 2>/dev/null | head -5 || echo "Git log not available"Length of output: 58
Remove test_transformer_memory_regression.py and reinstate disabled nightly test per the documented cleanup instructions.
The transformers>=4.57.1 constraint in setup.py (added for qwen3-next support) will immediately fail tests/unit/test_transformer_memory_regression.py, which enforces version [4.54, 4.56). This is actually good news—the memory regression is fixed in 4.57.1—but the cleanup steps documented in the test must be completed:
- Delete
tests/unit/test_transformer_memory_regression.py - Reinstate the disabled nightly test:
tests/test_suites/llm/dpo-mistral-nemo-instruct-2407-1n8g-fsdp2tp8-actckpt-long.sh - Update GitHub issue: Dtensor memory regression with TP and long sequence length present in
transformers>=4.54,<4.56#1343
🤖 Prompt for AI Agents
In 3rdparty/Megatron-Bridge-workspace/setup.py around line 45, the bump to
"transformers>=4.57.1" makes the existing unit test
test_transformer_memory_regression.py invalid; follow the documented cleanup:
delete tests/unit/test_transformer_memory_regression.py, re-enable the
previously disabled nightly test
tests/test_suites/llm/dpo-mistral-nemo-instruct-2407-1n8g-fsdp2tp8-actckpt-long.sh
(restore it to the test matrix), and update the referenced GitHub issue
https://github.com/NVIDIA-NeMo/RL/issues/1343 to reflect that the regression is
fixed and cleanup completed.
| megatron_cfg: | ||
| enabled: true | ||
| converter_type: Qwen3NextForCausalLM | ||
| pipeline_model_parallel_size: 2 | ||
| expert_model_parallel_size: 8 | ||
| optimizer: | ||
| lr: 2.0e-05 | ||
| min_lr: 1.99999e-05 | ||
| weight_decay: 0.01 | ||
| bf16: true | ||
| scheduler: | ||
| lr_warmup_init: 1.9999e-65 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double-check lr_warmup_init magnitude
lr: 2.0e-05, min_lr: 1.99999e-05, but scheduler.lr_warmup_init: 1.9999e-65 is extremely close to zero and looks like it might be a -05 vs -65 typo. That won’t break anything but will effectively start from ~0 lr during warmup.
If this wasn’t intentional (e.g., you meant 1.9999e-05 or similar), it’s worth correcting now to avoid confusing future readers.
🤖 Prompt for AI Agents
In examples/configs/recipes/llm/sft-qwen3-next-80ba3b-8n8g-megatron.yaml around
lines 19 to 30, the scheduler.lr_warmup_init value is set to 1.9999e-65 which is
extremely close to zero and likely a typo (should probably be 1.9999e-05);
update lr_warmup_init to the intended magnitude (e.g., 1.9999e-05 or another
value consistent with lr/min_lr) so warmup starts at the correct learning rate
and avoid confusing future readers.
| average_in_collective=self.cfg["megatron_cfg"][ | ||
| "distributed_data_parallel_config" | ||
| ]["average_in_collective"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Restore fallback for average_in_collective.
Directly indexing ["average_in_collective"] will raise a KeyError for any existing config (or user override) that hasn’t been updated with the new flag. That regresses previously working setups. Please retain a default (e.g., False) when the key is absent.
- average_in_collective=self.cfg["megatron_cfg"][
- "distributed_data_parallel_config"
- ]["average_in_collective"],
+ average_in_collective=self.cfg["megatron_cfg"][
+ "distributed_data_parallel_config"
+ ].get("average_in_collective", False),📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| average_in_collective=self.cfg["megatron_cfg"][ | |
| "distributed_data_parallel_config" | |
| ]["average_in_collective"], | |
| average_in_collective=self.cfg["megatron_cfg"][ | |
| "distributed_data_parallel_config" | |
| ].get("average_in_collective", False), |
🤖 Prompt for AI Agents
In nemo_rl/models/policy/megatron_policy_worker.py around lines 694 to 696, the
code directly indexes ["average_in_collective"] which will raise KeyError if the
key is missing; update the access to use a safe lookup with a default (e.g., use
.get on the parent dict and then .get("average_in_collective", False) or
equivalent) so that when the key is absent the value falls back to False instead
of raising an exception.
| "grad_norm": torch.tensor( | ||
| mb_metrics["grad_norm"][-1] | ||
| ).cpu(), # TODO @sahilj: return an average or something later |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard against empty grad-norm history.
When num_global_batches evaluates to zero (e.g., final partial batch smaller than gbs), mb_metrics["grad_norm"] stays empty and the new [-1] access throws. Please fall back to 0.0 (or skip emitting the metric) when the list is empty.
- "grad_norm": torch.tensor(
- mb_metrics["grad_norm"][-1]
- ).cpu(), # TODO @sahilj: return an average or something later
+ "grad_norm": torch.tensor(
+ (mb_metrics.get("grad_norm") or [0.0])[-1]
+ ).cpu(), # TODO @sahilj: return an average or something later📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "grad_norm": torch.tensor( | |
| mb_metrics["grad_norm"][-1] | |
| ).cpu(), # TODO @sahilj: return an average or something later | |
| "grad_norm": torch.tensor( | |
| (mb_metrics.get("grad_norm") or [0.0])[-1] | |
| ).cpu(), # TODO @sahilj: return an average or something later |
🤖 Prompt for AI Agents
In nemo_rl/models/policy/megatron_policy_worker.py around lines 1142 to 1144,
guard against an empty mb_metrics["grad_norm"] list before indexing [-1]; if the
list is non-empty use the last value, otherwise use 0.0 as the fallback (convert
to a tensor and .cpu() as before) so the code does not throw when
num_global_batches == 0.
| Only run metrics if the target step is reached | ||
| if [[ $(jq 'to_entries | .[] | select(.key == "train/loss") | .value | keys | map(tonumber) | max' $JSON_METRICS) -ge $MAX_STEPS ]]; then | ||
| uv run tests/check_metrics.py $JSON_METRICS \ | ||
| 'data["train/token_mult_prob_error"]["30"] < 1.1' \ | ||
| 'data["train/reward"]["30"] > 0.43' \ | ||
| 'mean(data["timing/train/total_step_time"], -6, -1) < 305' | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix missing comment marker before metrics gate
Line 34 (Only run metrics if the target step is reached) is missing # and will be executed as a command, causing the script to fail.
Recommend:
-Only run metrics if the target step is reached
+# Only run metrics if the target step is reached📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Only run metrics if the target step is reached | |
| if [[ $(jq 'to_entries | .[] | select(.key == "train/loss") | .value | keys | map(tonumber) | max' $JSON_METRICS) -ge $MAX_STEPS ]]; then | |
| uv run tests/check_metrics.py $JSON_METRICS \ | |
| 'data["train/token_mult_prob_error"]["30"] < 1.1' \ | |
| 'data["train/reward"]["30"] > 0.43' \ | |
| 'mean(data["timing/train/total_step_time"], -6, -1) < 305' | |
| fi | |
| # Only run metrics if the target step is reached | |
| if [[ $(jq 'to_entries | .[] | select(.key == "train/loss") | .value | keys | map(tonumber) | max' $JSON_METRICS) -ge $MAX_STEPS ]]; then | |
| uv run tests/check_metrics.py $JSON_METRICS \ | |
| 'data["train/token_mult_prob_error"]["30"] < 1.1' \ | |
| 'data["train/reward"]["30"] > 0.43' \ | |
| 'mean(data["timing/train/total_step_time"], -6, -1) < 305' | |
| fi |
🤖 Prompt for AI Agents
In tests/test_suites/llm/grpo-qwen3-next-80ba3b-8n8g-megatron.sh around lines 34
to 40 the explanatory line "Only run metrics if the target step is reached" is
missing a leading '#' so the shell will try to execute it; edit the file to
prefix that line with '#' to make it a comment (or remove the line entirely) so
the subsequent if-statement runs correctly and the script no longer attempts to
execute the text as a command.
Signed-off-by: Ahmad Kiswani <[email protected]>
ae5d7ac to
8a16bb5
Compare
❌ Submodule Fast-Forward Check FailedCheck based on commit: 8a16bb5 (PR #1530 from ✅ Submodules that are properly updated:Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward) ❌ Submodules that need attention:Automodel: ❌ Commits have DIVERGED from a common ancestor Please ensure all submodule commits are fast-forwards of the main branch before merging. |
Signed-off-by: Ahmad Kiswani <[email protected]>
|
With terrykong/Megatron-LM@0d40160 and after #1541 is merged, I'll rebase this PR which should reduce it to just configs and tests. |
What does this PR do ?
Add qwen3-next support, recipes and tests to mcore path.
Issues
#1431
Usage
SFT:
uv run --extra mcore --extra vllm --extra automodel python examples/run_sft.py \ --config examples/configs/recipes/llm/sft-qwen3-next-80ba3b-8n8g-megatron.yaml \GRPO:
uv run --extra mcore --extra vllm python examples/run_grpo_math.py \ --config examples/configs/recipes/llm/grpo-qwen3-next-80ba3b-8n8g-megatron.yamlBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit
New Features
Bug Fixes
Configuration Enhancements
Tests