Mmanohara/merge grpo helpsteer cp tp #1472

nv-mmanohara · 2025-11-04T21:37:36Z

What does this PR do ?

GRPO support for HelpSteer3 on LlamaNemotron 49B.

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

Release Notes

New Features
- Added GRPO training recipes for HelpSteer3 with Llama-3.2-1B and Nemotron-Super-49B models
- Added SFT training recipes for Nemotron-Super-49B, including Tulu3 dataset support
- Added HelpSteer3 environment for distributed model response verification
- Added example training scripts for GRPO workflows with HelpSteer3
- Added Tulu3 preference and SFT mixture dataset support
Bug Fixes
- Fixed logging configuration to gracefully handle missing optional settings

coderabbitai · 2025-11-04T21:44:45Z

📝 Walkthrough

Walkthrough

This PR introduces comprehensive support for GRPO training and SFT fine-tuning workflows. It adds configuration files for Llama-3.2 and Llama-3.3-Nemotron models with HelpSteer3 reward verification, implements new dataset classes for Tulu3 preference and SFT mixture tasks, and provides a distributed HelpSteer3 environment with verification workers. A new example script orchestrates GRPO training with data setup and configuration loading.

Changes

Cohort / File(s)	Summary
GRPO Training Configurations `examples/configs/recipes/llm/grpo-helpsteer3-llama-3.2-1b-1n8g-fsdp2tp1.yaml`, `examples/configs/recipes/llm/grpo-helpsteer3-llama-3.3-nemotron-super-49b-v1.5-4n8g-fsdp2tp8.yaml`	New YAML configs defining GRPO hyperparameters, loss functions, checkpointing, policy settings (DTensor, Megatron), optimizers, generation backends (vLLM), data handling, logging (WandB, TensorBoard, MLFlow), and cluster resource allocation for HelpSteer3-based training.
SFT Training Configurations `examples/configs/sft_nemotron_super_49b.yaml`, `examples/configs/sft_nemotron_super_49b_tulu_v3.yaml`	New YAML configs for SFT training on Nemotron-49B and Tulu3-SFT-Mixture datasets with training loop controls, checkpointing, policy settings, distributed parallelism options, and logging backends.
GRPO Training Orchestration `examples/run_grpo_helpsteer3.py`	New example script implementing CLI argument parsing, HelpSteer3 data processor (converts preference data to DatumSpec), data setup function (creates datasets and environments), and main orchestration calling grpo_train.
Script Updates `examples/run_sft.py`, `examples/configs/recipes/llm/llama_nemotron_super_49b_custom_plan.py`	SFT runner adds "max" OmegaConf resolver; custom plan script refactors static dictionary into computed get_custom_parallel_plan() function.
Tulu3 Dataset Implementations `nemo_rl/data/datasets/response_datasets/tulu3.py`	New dataset classes: Tulu3PreferenceDataset and Tulu3SftMixtureDataset with data formatting utilities (to_preference_data_format, format_tulu3_sft_mixture) including train/validation splits and input validation.
Response Dataset Registry `nemo_rl/data/datasets/response_datasets/__init__.py`	Adds import and load branch for Tulu3SftMixtureDataset with configurable test_size, prompt_file, and max_samples parameters.
HelpSteer3 Environment `nemo_rl/environments/helpsteer3_environment.py`	New Ray remote environment with distributed HelpSteer3VerifyWorker pool for parallel response verification, score calculation combining exact-match/Jaccard similarity/length penalty, and post-processing metrics.
Infrastructure Updates `nemo_rl/distributed/ray_actor_environment_registry.py`, `nemo_rl/data/datasets/preference_datasets/helpsteer3.py`, `nemo_rl/utils/logger.py`	Registers HelpSteer3Environment in actor registry; adds task_name field to HelpSteer3 preference data format; makes swanlab_enabled logger flag defensive against missing keys.

Sequence Diagram(s)

sequenceDiagram
    participant User as User/Script
    participant GRPOOrch as GRPO Orchestrator
    participant DataSetup as Data Setup
    participant Tokenizer as Tokenizer
    participant HelpSteer3Env as HelpSteer3 Environment
    participant Workers as Verify Workers
    
    User->>GRPOOrch: invoke main()
    GRPOOrch->>GRPOOrch: load config & CLI overrides
    GRPOOrch->>DataSetup: setup_data(tokenizer, config)
    
    DataSetup->>Tokenizer: initialize tokenizer
    DataSetup->>DataSetup: load HelpSteer3 preference dataset
    
    rect rgb(200, 220, 240)
        Note over DataSetup,HelpSteer3Env: Data Processing
        DataSetup->>DataSetup: helpsteer3_data_processor per datum<br/>(build chat, tokenize, compute ground_truth)
        DataSetup->>DataSetup: create AllTaskProcessedDataset<br/>(train + validation)
    end
    
    DataSetup->>HelpSteer3Env: initialize with config<br/>(num_workers, stop_strings)
    HelpSteer3Env->>Workers: spawn Ray remote workers<br/>(via SYSTEM Python)
    
    DataSetup-->>GRPOOrch: return datasets, environments, tokenizer
    
    rect rgb(240, 220, 200)
        Note over GRPOOrch,Workers: Training Loop
        GRPOOrch->>HelpSteer3Env: step(message_logs, metadata)
        HelpSteer3Env->>Workers: distribute response verification
        Workers->>Workers: verify & calculate scores<br/>(exact-match + Jaccard + length)
        Workers-->>HelpSteer3Env: scores & extracted answers
        HelpSteer3Env->>HelpSteer3Env: aggregate results, compute metrics
        HelpSteer3Env-->>GRPOOrch: EnvironmentReturn (rewards, observations)
    end
    
    GRPOOrch->>GRPOOrch: invoke grpo_train(policy, dataloaders, environments)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

HelpSteer3Environment implementation: Distributed Ray-based verification with custom scoring logic combining multiple heuristics; requires careful review of correctness of similarity calculations and worker pool management.
GRPO orchestration script: Complex data setup flow with multiple preprocessing steps, task spec construction, and environment initialization; verify data processor handles all edge cases and token truncation correctly.
Tulu3 dataset classes: Multiple validation checks and assertions across preference/SFT formatting; ensure invariants (last message from assistant, context consistency) are enforced correctly.
Heterogeneous file scope: Changes span configuration files, orchestration logic, dataset implementations, environment classes, and utility updates, requiring context-switching across domains.
New public API surface: Multiple exported classes and functions (HelpSteer3Environment, Tulu3SftMixtureDataset, helpsteer3_data_processor) that become dependencies for downstream GRPO workflows.

Possibly related PRs

feat: Support Reward Model based Environments #1026: Adds new environment implementations and modifies ACTOR_ENVIRONMENT_REGISTRY to register environment classes—direct parallel with this PR's HelpSteer3Environment registration.
refactor: refactor dataset module #977: Introduces dataset module refactor including response and preference dataset loaders; this PR extends that refactor by adding Tulu3 and HelpSteer3 datasets to the ecosystem.

Suggested labels

CI:L1

Suggested reviewers

terrykong
ashors1

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR introduces major GRPO/HelpSteer3 features but lacks test results documentation. Multiple bugs identified in review comments indicate inadequate testing before submission.	Add comprehensive test results for new components, fix identified bugs in data handling and configuration, and document validation before merging.
Title check	❓ Inconclusive	The title is vague and uses abbreviated technical jargon without clearly conveying the main change; 'grpo helpsteer cp tp' lacks descriptive context about what is being added or merged.	Replace with a clear, descriptive title such as 'Add GRPO training support for HelpSteer3 with Llama-Nemotron-49B' to accurately summarize the primary change.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

nemo_rl/utils/logger.py (1)
818-822: Configuration default violates coding guidelines.

Using cfg.get("swanlab_enabled", False) introduces a hidden default in code, violating the project's configuration guidelines. The TypedDict LoggerConfig at line 78 declares swanlab_enabled: bool without NotRequired, meaning it should be accessed directly as cfg["swanlab_enabled"].

As per coding guidelines:

"Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults"

"Express configuration optionality via TypedDict using typing.NotRequired"

Apply this fix:
-        if cfg.get("swanlab_enabled", False):
+        if cfg["swanlab_enabled"]:
And update the TypedDict at line 78:
-    swanlab_enabled: bool
+    swanlab_enabled: NotRequired[bool]
Then document the recommended default (e.g., false) in exemplar YAML configs.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8762f57 and 25ef396.

📒 Files selected for processing (13)

examples/configs/recipes/llm/grpo-helpsteer3-llama-3.2-1b-1n8g-fsdp2tp1.yaml (1 hunks)
examples/configs/recipes/llm/grpo-helpsteer3-llama-3.3-nemotron-super-49b-v1.5-4n8g-fsdp2tp8.yaml (1 hunks)
examples/configs/recipes/llm/llama_nemotron_super_49b_custom_plan.py (1 hunks)
examples/configs/sft_nemotron_super_49b.yaml (1 hunks)
examples/configs/sft_nemotron_super_49b_tulu_v3.yaml (1 hunks)
examples/run_grpo_helpsteer3.py (1 hunks)
examples/run_sft.py (1 hunks)
nemo_rl/data/datasets/preference_datasets/helpsteer3.py (1 hunks)
nemo_rl/data/datasets/response_datasets/__init__.py (3 hunks)
nemo_rl/data/datasets/response_datasets/tulu3.py (1 hunks)
nemo_rl/distributed/ray_actor_environment_registry.py (1 hunks)
nemo_rl/environments/helpsteer3_environment.py (1 hunks)
nemo_rl/utils/logger.py (2 hunks)

🧰 Additional context used

📓 Path-based instructions (7)

examples/configs/*.yaml