Skip to content

Conversation

@nv-mmanohara
Copy link

@nv-mmanohara nv-mmanohara commented Nov 4, 2025

What does this PR do ?

GRPO support for HelpSteer3 on LlamaNemotron 49B.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

Release Notes

  • New Features

    • Added GRPO training recipes for HelpSteer3 with Llama-3.2-1B and Nemotron-Super-49B models
    • Added SFT training recipes for Nemotron-Super-49B, including Tulu3 dataset support
    • Added HelpSteer3 environment for distributed model response verification
    • Added example training scripts for GRPO workflows with HelpSteer3
    • Added Tulu3 preference and SFT mixture dataset support
  • Bug Fixes

    • Fixed logging configuration to gracefully handle missing optional settings

@nv-mmanohara nv-mmanohara requested review from a team as code owners November 4, 2025 21:37
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 4, 2025

📝 Walkthrough

Walkthrough

This PR introduces comprehensive support for GRPO training and SFT fine-tuning workflows. It adds configuration files for Llama-3.2 and Llama-3.3-Nemotron models with HelpSteer3 reward verification, implements new dataset classes for Tulu3 preference and SFT mixture tasks, and provides a distributed HelpSteer3 environment with verification workers. A new example script orchestrates GRPO training with data setup and configuration loading.

Changes

Cohort / File(s) Summary
GRPO Training Configurations
examples/configs/recipes/llm/grpo-helpsteer3-llama-3.2-1b-1n8g-fsdp2tp1.yaml, examples/configs/recipes/llm/grpo-helpsteer3-llama-3.3-nemotron-super-49b-v1.5-4n8g-fsdp2tp8.yaml
New YAML configs defining GRPO hyperparameters, loss functions, checkpointing, policy settings (DTensor, Megatron), optimizers, generation backends (vLLM), data handling, logging (WandB, TensorBoard, MLFlow), and cluster resource allocation for HelpSteer3-based training.
SFT Training Configurations
examples/configs/sft_nemotron_super_49b.yaml, examples/configs/sft_nemotron_super_49b_tulu_v3.yaml
New YAML configs for SFT training on Nemotron-49B and Tulu3-SFT-Mixture datasets with training loop controls, checkpointing, policy settings, distributed parallelism options, and logging backends.
GRPO Training Orchestration
examples/run_grpo_helpsteer3.py
New example script implementing CLI argument parsing, HelpSteer3 data processor (converts preference data to DatumSpec), data setup function (creates datasets and environments), and main orchestration calling grpo_train.
Script Updates
examples/run_sft.py, examples/configs/recipes/llm/llama_nemotron_super_49b_custom_plan.py
SFT runner adds "max" OmegaConf resolver; custom plan script refactors static dictionary into computed get_custom_parallel_plan() function.
Tulu3 Dataset Implementations
nemo_rl/data/datasets/response_datasets/tulu3.py
New dataset classes: Tulu3PreferenceDataset and Tulu3SftMixtureDataset with data formatting utilities (to_preference_data_format, format_tulu3_sft_mixture) including train/validation splits and input validation.
Response Dataset Registry
nemo_rl/data/datasets/response_datasets/__init__.py
Adds import and load branch for Tulu3SftMixtureDataset with configurable test_size, prompt_file, and max_samples parameters.
HelpSteer3 Environment
nemo_rl/environments/helpsteer3_environment.py
New Ray remote environment with distributed HelpSteer3VerifyWorker pool for parallel response verification, score calculation combining exact-match/Jaccard similarity/length penalty, and post-processing metrics.
Infrastructure Updates
nemo_rl/distributed/ray_actor_environment_registry.py, nemo_rl/data/datasets/preference_datasets/helpsteer3.py, nemo_rl/utils/logger.py
Registers HelpSteer3Environment in actor registry; adds task_name field to HelpSteer3 preference data format; makes swanlab_enabled logger flag defensive against missing keys.

Sequence Diagram(s)

sequenceDiagram
    participant User as User/Script
    participant GRPOOrch as GRPO Orchestrator
    participant DataSetup as Data Setup
    participant Tokenizer as Tokenizer
    participant HelpSteer3Env as HelpSteer3 Environment
    participant Workers as Verify Workers
    
    User->>GRPOOrch: invoke main()
    GRPOOrch->>GRPOOrch: load config & CLI overrides
    GRPOOrch->>DataSetup: setup_data(tokenizer, config)
    
    DataSetup->>Tokenizer: initialize tokenizer
    DataSetup->>DataSetup: load HelpSteer3 preference dataset
    
    rect rgb(200, 220, 240)
        Note over DataSetup,HelpSteer3Env: Data Processing
        DataSetup->>DataSetup: helpsteer3_data_processor per datum<br/>(build chat, tokenize, compute ground_truth)
        DataSetup->>DataSetup: create AllTaskProcessedDataset<br/>(train + validation)
    end
    
    DataSetup->>HelpSteer3Env: initialize with config<br/>(num_workers, stop_strings)
    HelpSteer3Env->>Workers: spawn Ray remote workers<br/>(via SYSTEM Python)
    
    DataSetup-->>GRPOOrch: return datasets, environments, tokenizer
    
    rect rgb(240, 220, 200)
        Note over GRPOOrch,Workers: Training Loop
        GRPOOrch->>HelpSteer3Env: step(message_logs, metadata)
        HelpSteer3Env->>Workers: distribute response verification
        Workers->>Workers: verify & calculate scores<br/>(exact-match + Jaccard + length)
        Workers-->>HelpSteer3Env: scores & extracted answers
        HelpSteer3Env->>HelpSteer3Env: aggregate results, compute metrics
        HelpSteer3Env-->>GRPOOrch: EnvironmentReturn (rewards, observations)
    end
    
    GRPOOrch->>GRPOOrch: invoke grpo_train(policy, dataloaders, environments)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

  • HelpSteer3Environment implementation: Distributed Ray-based verification with custom scoring logic combining multiple heuristics; requires careful review of correctness of similarity calculations and worker pool management.
  • GRPO orchestration script: Complex data setup flow with multiple preprocessing steps, task spec construction, and environment initialization; verify data processor handles all edge cases and token truncation correctly.
  • Tulu3 dataset classes: Multiple validation checks and assertions across preference/SFT formatting; ensure invariants (last message from assistant, context consistency) are enforced correctly.
  • Heterogeneous file scope: Changes span configuration files, orchestration logic, dataset implementations, environment classes, and utility updates, requiring context-switching across domains.
  • New public API surface: Multiple exported classes and functions (HelpSteer3Environment, Tulu3SftMixtureDataset, helpsteer3_data_processor) that become dependencies for downstream GRPO workflows.

Possibly related PRs

Suggested labels

CI:L1

Suggested reviewers

  • terrykong
  • ashors1

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR introduces major GRPO/HelpSteer3 features but lacks test results documentation. Multiple bugs identified in review comments indicate inadequate testing before submission. Add comprehensive test results for new components, fix identified bugs in data handling and configuration, and document validation before merging.
Title check ❓ Inconclusive The title is vague and uses abbreviated technical jargon without clearly conveying the main change; 'grpo helpsteer cp tp' lacks descriptive context about what is being added or merged. Replace with a clear, descriptive title such as 'Add GRPO training support for HelpSteer3 with Llama-Nemotron-49B' to accurately summarize the primary change.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
nemo_rl/utils/logger.py (1)

818-822: Configuration default violates coding guidelines.

Using cfg.get("swanlab_enabled", False) introduces a hidden default in code, violating the project's configuration guidelines. The TypedDict LoggerConfig at line 78 declares swanlab_enabled: bool without NotRequired, meaning it should be accessed directly as cfg["swanlab_enabled"].

As per coding guidelines:

  • "Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults"
  • "Express configuration optionality via TypedDict using typing.NotRequired"

Apply this fix:

-        if cfg.get("swanlab_enabled", False):
+        if cfg["swanlab_enabled"]:

And update the TypedDict at line 78:

-    swanlab_enabled: bool
+    swanlab_enabled: NotRequired[bool]

Then document the recommended default (e.g., false) in exemplar YAML configs.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8762f57 and 25ef396.

📒 Files selected for processing (13)
  • examples/configs/recipes/llm/grpo-helpsteer3-llama-3.2-1b-1n8g-fsdp2tp1.yaml (1 hunks)
  • examples/configs/recipes/llm/grpo-helpsteer3-llama-3.3-nemotron-super-49b-v1.5-4n8g-fsdp2tp8.yaml (1 hunks)
  • examples/configs/recipes/llm/llama_nemotron_super_49b_custom_plan.py (1 hunks)
  • examples/configs/sft_nemotron_super_49b.yaml (1 hunks)
  • examples/configs/sft_nemotron_super_49b_tulu_v3.yaml (1 hunks)
  • examples/run_grpo_helpsteer3.py (1 hunks)
  • examples/run_sft.py (1 hunks)
  • nemo_rl/data/datasets/preference_datasets/helpsteer3.py (1 hunks)
  • nemo_rl/data/datasets/response_datasets/__init__.py (3 hunks)
  • nemo_rl/data/datasets/response_datasets/tulu3.py (1 hunks)
  • nemo_rl/distributed/ray_actor_environment_registry.py (1 hunks)
  • nemo_rl/environments/helpsteer3_environment.py (1 hunks)
  • nemo_rl/utils/logger.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (7)
examples/configs/*.yaml

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

examples/configs/*.yaml: Exemplar configs under examples/configs/.yaml must include documented defaults
When adding a new config key, reflect its recommended default in exemplar YAMLs under examples/configs/
.yaml

Files:

  • examples/configs/sft_nemotron_super_49b_tulu_v3.yaml
  • examples/configs/sft_nemotron_super_49b.yaml
examples/configs/recipes/**/*.yaml

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

examples/configs/recipes/**/*.yaml: Recipe YAMLs under examples/configs/recipes/** are runnable snapshots and may omit documentation
When adding support for a new model, add a recipe YAML under examples/configs/recipes/ in the appropriate domain (llm/ or vlm/) with the correct name

Files:

  • examples/configs/recipes/llm/grpo-helpsteer3-llama-3.3-nemotron-super-49b-v1.5-4n8g-fsdp2tp8.yaml
  • examples/configs/recipes/llm/grpo-helpsteer3-llama-3.2-1b-1n8g-fsdp2tp1.yaml
examples/configs/recipes/llm/*.yaml

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

LLM recipe YAML filenames must follow: --ng-[-modifiers][-long][.vN].yaml

Files:

  • examples/configs/recipes/llm/grpo-helpsteer3-llama-3.3-nemotron-super-49b-v1.5-4n8g-fsdp2tp8.yaml
  • examples/configs/recipes/llm/grpo-helpsteer3-llama-3.2-1b-1n8g-fsdp2tp1.yaml
examples/configs/recipes/**/*.{yaml,sh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Known exception: Deepscaler recipes may encode context length in place of the cluster tuple (e.g., grpo-deepscaler-1.5b-8K.*); allowed but document intended hardware in the script

Files:

  • examples/configs/recipes/llm/grpo-helpsteer3-llama-3.3-nemotron-super-49b-v1.5-4n8g-fsdp2tp8.yaml
  • examples/configs/recipes/llm/grpo-helpsteer3-llama-3.2-1b-1n8g-fsdp2tp1.yaml
examples/configs/recipes/**

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Place recipe YAMLs under examples/configs/recipes//

Files:

  • examples/configs/recipes/llm/grpo-helpsteer3-llama-3.3-nemotron-super-49b-v1.5-4n8g-fsdp2tp8.yaml
  • examples/configs/recipes/llm/llama_nemotron_super_49b_custom_plan.py
  • examples/configs/recipes/llm/grpo-helpsteer3-llama-3.2-1b-1n8g-fsdp2tp1.yaml
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts

Files:

  • examples/configs/recipes/llm/llama_nemotron_super_49b_custom_plan.py
  • nemo_rl/utils/logger.py
  • nemo_rl/distributed/ray_actor_environment_registry.py
  • examples/run_sft.py
  • nemo_rl/data/datasets/preference_datasets/helpsteer3.py
  • nemo_rl/data/datasets/response_datasets/__init__.py
  • nemo_rl/data/datasets/response_datasets/tulu3.py
  • nemo_rl/environments/helpsteer3_environment.py
  • examples/run_grpo_helpsteer3.py
nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)

Files:

  • nemo_rl/utils/logger.py
  • nemo_rl/distributed/ray_actor_environment_registry.py
  • nemo_rl/data/datasets/preference_datasets/helpsteer3.py
  • nemo_rl/data/datasets/response_datasets/__init__.py
  • nemo_rl/data/datasets/response_datasets/tulu3.py
  • nemo_rl/environments/helpsteer3_environment.py
🧠 Learnings (4)
📚 Learning: 2025-09-18T14:57:31.003Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1006
File: nemo_rl/algorithms/distillation.py:312-354
Timestamp: 2025-09-18T14:57:31.003Z
Learning: The distillation algorithm's cluster setup logic is designed to follow the same patterns used in GRPO for handling distributed training clusters and resource allocation.

Applied to files:

  • examples/configs/recipes/llm/grpo-helpsteer3-llama-3.3-nemotron-super-49b-v1.5-4n8g-fsdp2tp8.yaml
  • examples/configs/recipes/llm/grpo-helpsteer3-llama-3.2-1b-1n8g-fsdp2tp1.yaml
📚 Learning: 2025-09-19T03:00:58.662Z
Learnt from: shuo-nvidia
Repo: NVIDIA-NeMo/RL PR: 1006
File: examples/configs/recipes/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-fsdp2tp1.v1.yaml:85-101
Timestamp: 2025-09-19T03:00:58.662Z
Learning: In distillation and GRPO configurations, max_new_tokens is intentionally set to the full context window (max_total_sequence_length) for consistency across the codebase. Overflow cases when prompt + generation tokens exceed max_model_len are handled by safeguards implemented in vllm_worker.py.

Applied to files:

  • examples/configs/recipes/llm/grpo-helpsteer3-llama-3.3-nemotron-super-49b-v1.5-4n8g-fsdp2tp8.yaml
  • examples/configs/recipes/llm/grpo-helpsteer3-llama-3.2-1b-1n8g-fsdp2tp1.yaml
📚 Learning: 2025-09-20T14:58:45.492Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-20T14:58:45.492Z
Learning: Applies to examples/configs/recipes/llm/*.yaml : LLM recipe YAML filenames must follow: <algo>-<model>-<nodes>n<gpus>g-<strategy-and-params>[-modifiers][-long][.vN].yaml

Applied to files:

  • examples/configs/recipes/llm/grpo-helpsteer3-llama-3.2-1b-1n8g-fsdp2tp1.yaml
📚 Learning: 2025-09-20T14:58:45.492Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-20T14:58:45.492Z
Learning: Applies to **/*.py : Target Python 3.12+ for all Python code in NeMo-RL

Applied to files:

  • nemo_rl/distributed/ray_actor_environment_registry.py
🧬 Code graph analysis (6)
nemo_rl/distributed/ray_actor_environment_registry.py (1)
nemo_rl/distributed/virtual_cluster.py (1)
  • PY_EXECUTABLES (43-59)
examples/run_sft.py (1)
tests/check_metrics.py (1)
  • max (30-32)
nemo_rl/data/datasets/response_datasets/__init__.py (1)
nemo_rl/data/datasets/response_datasets/tulu3.py (1)
  • Tulu3SftMixtureDataset (95-148)
nemo_rl/data/datasets/response_datasets/tulu3.py (2)
nemo_rl/data/interfaces.py (1)
  • TaskDataSpec (53-86)
nemo_rl/data/datasets/preference_datasets/helpsteer3.py (1)
  • to_preference_data_format (22-55)
nemo_rl/environments/helpsteer3_environment.py (4)
nemo_rl/distributed/batched_data_dict.py (2)
  • BatchedDataDict (75-860)
  • chunk (199-235)
nemo_rl/distributed/virtual_cluster.py (1)
  • PY_EXECUTABLES (43-59)
nemo_rl/environments/interfaces.py (2)
  • EnvironmentInterface (52-88)
  • EnvironmentReturn (26-49)
nemo_rl/environments/utils.py (1)
  • chunk_list_to_workers (17-61)
examples/run_grpo_helpsteer3.py (10)
nemo_rl/algorithms/utils.py (1)
  • get_tokenizer (157-288)
nemo_rl/data/__init__.py (1)
  • DataConfig (18-40)
nemo_rl/data/datasets/processed_dataset.py (1)
  • AllTaskProcessedDataset (31-126)
nemo_rl/data/datasets/preference_datasets/__init__.py (1)
  • load_preference_dataset (25-78)
nemo_rl/data/interfaces.py (3)
  • DatumSpec (32-40)
  • TaskDataProcessFnCallable (89-100)
  • TaskDataSpec (53-86)
nemo_rl/distributed/ray_actor_environment_registry.py (1)
  • get_actor_python_env (50-65)
nemo_rl/distributed/virtual_cluster.py (1)
  • init_ray (85-171)
nemo_rl/environments/helpsteer3_environment.py (1)
  • HelpSteer3Environment (141-259)
nemo_rl/models/generation/__init__.py (1)
  • configure_generation_config (24-47)
nemo_rl/utils/logger.py (1)
  • get_next_experiment_dir (1311-1345)
🪛 Ruff (0.14.3)
nemo_rl/data/datasets/response_datasets/tulu3.py

87-87: Avoid specifying long messages outside the exception class

(TRY003)

nemo_rl/environments/helpsteer3_environment.py

70-70: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


81-81: Do not catch blind exception: Exception

(BLE001)


201-201: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

examples/run_grpo_helpsteer3.py

164-164: Unused function argument: seed

(ARG001)


208-208: Unnecessary key check before dictionary access

Replace with dict.get

(RUF019)


278-278: Unpacked variable cluster is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Post submodule check comment / Comment on PR
  • GitHub Check: Post automodel integration comment / Comment on PR
🔇 Additional comments (6)
examples/run_sft.py (1)

35-35: LGTM!

The max resolver is correctly registered and used by the new YAML configs (e.g., sft_nemotron_super_49b_tulu_v3.yaml at line 57) for computing make_sequence_length_divisible_by.

examples/configs/sft_nemotron_super_49b_tulu_v3.yaml (1)

1-115: LGTM!

The configuration follows established patterns and correctly uses the max OmegaConf resolver for computing sequence length constraints. The structure aligns well with related Nemotron/GRPO configurations introduced in this PR.

examples/configs/sft_nemotron_super_49b.yaml (1)

1-134: LGTM!

The configuration is well-structured with helpful commented examples for alternative models and datasets. The use of the max resolver and the distributed training configurations align with the patterns established in related configs.

nemo_rl/data/datasets/response_datasets/__init__.py (1)

97-104: LGTM!

The integration of Tulu3SftMixtureDataset follows the established pattern for dataset loading, with appropriate parameter forwarding and console logging.

nemo_rl/data/datasets/preference_datasets/helpsteer3.py (1)

54-54: LGTM!

Adding the task_name field provides useful task identification for GRPO workflows without altering the existing preference data selection logic.

nemo_rl/distributed/ray_actor_environment_registry.py (1)

39-39: LGTM!

The registry mapping for HelpSteer3Environment correctly uses PY_EXECUTABLES.SYSTEM, consistent with other environment actors that don't require special dependencies.

Comment on lines 189 to 191
cluster:
gpus_per_node: 8
num_nodes: 16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix cluster node count mismatch.

The recipe advertises 4n8g, but the cluster.num_nodes value is hard-coded to 16. This will over-request hardware and breaks the naming contract for runnable snapshots. Please drop num_nodes back to 4 (or rename the file/config) so the declared topology matches the actual request.

 cluster:
   gpus_per_node: 8
-  num_nodes: 16
+  num_nodes: 4
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cluster:
gpus_per_node: 8
num_nodes: 16
cluster:
gpus_per_node: 8
num_nodes: 4
🤖 Prompt for AI Agents
In
examples/configs/recipes/llm/grpo-helpsteer3-llama-3.3-nemotron-super-49b-v1.5-4n8g-fsdp2tp8.yaml
around lines 189–191, the cluster topology is inconsistent with the
filename/advertised topology: gpus_per_node: 8 and num_nodes: 16 (file
advertises 4n8g). Change cluster.num_nodes to 4 to match the 4-node,
8-GPU-per-node topology (or alternatively rename the file/config to reflect 16
nodes), ensuring the declared topology matches the actual resource request and
naming contract.

Comment on lines 54 to 56
if False:
# Enable sequence parallelism only if TP size > 1
base_model_tp_plan.update(cast(dict[str, ParallelStyle], base_model_sp_plan))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Dead code requires clarification or removal.

The if False: block will never execute. Per coding guidelines: commented-out or dead code must include a nearby comment explaining usage and why it's disabled, otherwise it should be removed before merging.

If sequence parallelism support is planned for future work, add a clear TODO comment explaining the intent and conditions for enabling it. Otherwise, remove this block.

If keeping for future work:

+    # TODO: Enable sequence parallelism when TP size > 1
+    # Requires configuration plumbing to pass TP size to this function
     if False:
-        # Enable sequence parallelism only if TP size > 1
         base_model_tp_plan.update(cast(dict[str, ParallelStyle], base_model_sp_plan))

Or remove if not planned:

-    if False:
-        # Enable sequence parallelism only if TP size > 1
-        base_model_tp_plan.update(cast(dict[str, ParallelStyle], base_model_sp_plan))
-
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if False:
# Enable sequence parallelism only if TP size > 1
base_model_tp_plan.update(cast(dict[str, ParallelStyle], base_model_sp_plan))
# TODO: Enable sequence parallelism when TP size > 1
# Requires configuration plumbing to pass TP size to this function
if False:
base_model_tp_plan.update(cast(dict[str, ParallelStyle], base_model_sp_plan))
Suggested change
if False:
# Enable sequence parallelism only if TP size > 1
base_model_tp_plan.update(cast(dict[str, ParallelStyle], base_model_sp_plan))
🤖 Prompt for AI Agents
In examples/configs/recipes/llm/llama_nemotron_super_49b_custom_plan.py around
lines 54 to 56, there is a dead `if False:` block that never runs; either remove
the block entirely or replace it with a clear TODO comment explaining that
sequence-parallelism support is planned for future work, under what conditions
it should be enabled (e.g., when tensor-parallel size > 1 and sequence
parallelism implemented), and include a short example or link to the tracking
ticket; if you remove it, delete the three lines and any unused imports or
casts; if you keep it, change `if False:` to a commented TODO and keep the
explanatory note and required enabling conditions.

Comment on lines +131 to +147
for i in range(1, len(message_log)):
message_log[i]["token_ids"] = tokenizer("", return_tensors="pt")["input_ids"][0] # Empty tensor

length = sum(len(m["token_ids"]) for m in message_log)

# Create ground truth from the preferred completion for environment evaluation
ground_truth = " ".join([msg["content"] for msg in preferred_completion])
extra_env_info = {"ground_truth": ground_truth}

loss_multiplier = 1.0
if length > max_seq_length:
# Truncate if too long
for chat_message in message_log:
chat_message["token_ids"] = chat_message["token_ids"][
: min(max_seq_length // len(message_log), len(chat_message["token_ids"]))
]
loss_multiplier = 0.1 # Reduce loss for truncated sequences
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix message token clearing and truncation math.

Two problems here corrupt the training data:

  1. Clearing per-message tokens via tokenizer("") still yields a BOS token, so every non-head message contributes stray tokens to length.
  2. When length > max_seq_length, slicing each message to max_seq_length // len(message_log) throws away most of the conversation and leaves length unrecomputed, so downstream batching still treats the sample as over-length.

Please zero the non-head tensors with new_empty(0) (or similar) and, on truncation, only clip the head tensor to max_seq_length, then recompute length to reflect the actual token count.

-    for i in range(1, len(message_log)):
-        message_log[i]["token_ids"] = tokenizer("", return_tensors="pt")["input_ids"][0]  # Empty tensor
-
-    length = sum(len(m["token_ids"]) for m in message_log)
-
-    if length > max_seq_length:
-        # Truncate if too long
-        for chat_message in message_log:
-            chat_message["token_ids"] = chat_message["token_ids"][
-                : min(max_seq_length // len(message_log), len(chat_message["token_ids"]))
-            ]
-        loss_multiplier = 0.1  # Reduce loss for truncated sequences
+    empty_tokens = message_log[0]["token_ids"].new_empty((0,))
+    for i in range(1, len(message_log)):
+        message_log[i]["token_ids"] = empty_tokens
+
+    length = sum(len(m["token_ids"]) for m in message_log)
+
+    if max_seq_length is not None and length > max_seq_length:
+        message_log[0]["token_ids"] = message_log[0]["token_ids"][:max_seq_length]
+        loss_multiplier = 0.1  # Reduce loss for truncated sequences
+        length = sum(len(m["token_ids"]) for m in message_log)

Comment on lines 52 to 64
return {
"context": context,
"completions": [
{
"rank": 0,
"completion": [{"role": "assistant", "content": chosen_response}],
},
{
"rank": 1,
"completion": [{"role": "assistant", "content": rejected_response}],
},
],
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Include task_name in preference records.

Every entry returned by to_preference_data_format needs a task_name so AllTaskProcessedDataset can resolve the correct processor. Right now the map in Tulu3PreferenceDataset will hit the assertion (task processor not provided for ...) because the key is missing entirely. Add the "task_name": "Tulu3Preference" field to each record.

     return {
         "context": context,
         "completions": [
             {
                 "rank": 0,
                 "completion": [{"role": "assistant", "content": chosen_response}],
             },
             {
                 "rank": 1,
                 "completion": [{"role": "assistant", "content": rejected_response}],
             },
         ],
+        "task_name": "Tulu3Preference",
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
return {
"context": context,
"completions": [
{
"rank": 0,
"completion": [{"role": "assistant", "content": chosen_response}],
},
{
"rank": 1,
"completion": [{"role": "assistant", "content": rejected_response}],
},
],
}
return {
"context": context,
"completions": [
{
"rank": 0,
"completion": [{"role": "assistant", "content": chosen_response}],
},
{
"rank": 1,
"completion": [{"role": "assistant", "content": rejected_response}],
},
],
"task_name": "Tulu3Preference",
}
🤖 Prompt for AI Agents
In nemo_rl/data/datasets/response_datasets/tulu3.py around lines 52 to 64, the
preference record dicts returned by to_preference_data_format are missing the
required "task_name" key; add the field "task_name": "Tulu3Preference" to every
top-level preference record (the dict that currently contains "context" and
"completions") so AllTaskProcessedDataset can resolve the processor; ensure the
key/value is included alongside "context" and "completions" in both returned
records.

Comment on lines +848 to 851
if cfg.get("swanlab_enabled", False) and self.swanlab_logger:
self.swanlab_logger.define_metric(
f"{metric_prefix}/*", step_metric=step_metric
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Same configuration default issue.

Same violation as line 818: cfg.get("swanlab_enabled", False) should be cfg["swanlab_enabled"] after marking the field as NotRequired[bool] in the TypedDict and documenting the default in YAML configs.

Apply this fix:

-            if cfg.get("swanlab_enabled", False) and self.swanlab_logger:
+            if cfg["swanlab_enabled"] and self.swanlab_logger:

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In nemo_rl/utils/logger.py around lines 848 to 851, change the
cfg.get("swanlab_enabled", False) usage to direct indexing
cfg["swanlab_enabled"] (remove the False default) because the TypedDict field
has been marked NotRequired[bool] and the default is documented in YAML; update
the line to use cfg["swanlab_enabled"] and keep the existing self.swanlab_logger
check unchanged so the code relies on the documented default rather than a local
fallback.

Copy link
Contributor

@yuki-97 yuki-97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the contribution @nv-mmanohara ! left some comments, please ping me if you need any helps.

@@ -12,38 +12,73 @@
# See the License for the specific language governing permissions and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realize this file and examples/custom_parallel.py are better to be put into another place, e.g., examples/custom_parallel_plan/.

can you help to move them or do you mind me helping to do this on your branch directly?

@@ -0,0 +1,134 @@
# SFT Algorithm Configuration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this file and another sft yaml are better to move to under examples/configs/recipes/llm and change to the same name format like grpo-helpsteer3-llama-3.2-1b-1n8g-fsdp2tp1.yaml, can you help to move?

@@ -0,0 +1,305 @@
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this file different from examples/run_grpo_math.py? if there are minor changes, can we just update that file instead of creating a new one?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the code is redundant but many configurations are hardcoded and each dataset needs a specific preprocessing strategy, we’ll probably end up duplicating files for new datasets.

}


class Tulu3PreferenceDataset:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are Tulu3PreferenceDataset and to_preference_data_format in the file useful? can we remove them if not?

self.task_spec = TaskDataSpec(
task_name="Tulu3SftMixture",
prompt_file=prompt_file,
) No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typo

def __init__(self) -> None:
pass

def verify(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious, how is this environment different with the exist environment? can we do something to our exist environment instead of adding a new one?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the case for run_grpo_helpsteer3.py file.

self.loggers.append(self.wandb_logger)

if cfg["swanlab_enabled"]:
if cfg.get("swanlab_enabled", False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change is no need any more after inheriting the yaml. also for the below one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants