-
Notifications
You must be signed in to change notification settings - Fork 167
Mmanohara/merge grpo helpsteer cp tp #1472
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Mmanohara/merge grpo helpsteer cp tp #1472
Conversation
📝 WalkthroughWalkthroughThis PR introduces comprehensive support for GRPO training and SFT fine-tuning workflows. It adds configuration files for Llama-3.2 and Llama-3.3-Nemotron models with HelpSteer3 reward verification, implements new dataset classes for Tulu3 preference and SFT mixture tasks, and provides a distributed HelpSteer3 environment with verification workers. A new example script orchestrates GRPO training with data setup and configuration loading. Changes
Sequence Diagram(s)sequenceDiagram
participant User as User/Script
participant GRPOOrch as GRPO Orchestrator
participant DataSetup as Data Setup
participant Tokenizer as Tokenizer
participant HelpSteer3Env as HelpSteer3 Environment
participant Workers as Verify Workers
User->>GRPOOrch: invoke main()
GRPOOrch->>GRPOOrch: load config & CLI overrides
GRPOOrch->>DataSetup: setup_data(tokenizer, config)
DataSetup->>Tokenizer: initialize tokenizer
DataSetup->>DataSetup: load HelpSteer3 preference dataset
rect rgb(200, 220, 240)
Note over DataSetup,HelpSteer3Env: Data Processing
DataSetup->>DataSetup: helpsteer3_data_processor per datum<br/>(build chat, tokenize, compute ground_truth)
DataSetup->>DataSetup: create AllTaskProcessedDataset<br/>(train + validation)
end
DataSetup->>HelpSteer3Env: initialize with config<br/>(num_workers, stop_strings)
HelpSteer3Env->>Workers: spawn Ray remote workers<br/>(via SYSTEM Python)
DataSetup-->>GRPOOrch: return datasets, environments, tokenizer
rect rgb(240, 220, 200)
Note over GRPOOrch,Workers: Training Loop
GRPOOrch->>HelpSteer3Env: step(message_logs, metadata)
HelpSteer3Env->>Workers: distribute response verification
Workers->>Workers: verify & calculate scores<br/>(exact-match + Jaccard + length)
Workers-->>HelpSteer3Env: scores & extracted answers
HelpSteer3Env->>HelpSteer3Env: aggregate results, compute metrics
HelpSteer3Env-->>GRPOOrch: EnvironmentReturn (rewards, observations)
end
GRPOOrch->>GRPOOrch: invoke grpo_train(policy, dataloaders, environments)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes
Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nemo_rl/utils/logger.py (1)
818-822: Configuration default violates coding guidelines.Using
cfg.get("swanlab_enabled", False)introduces a hidden default in code, violating the project's configuration guidelines. The TypedDictLoggerConfigat line 78 declaresswanlab_enabled: boolwithoutNotRequired, meaning it should be accessed directly ascfg["swanlab_enabled"].As per coding guidelines:
- "Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults"
- "Express configuration optionality via TypedDict using typing.NotRequired"
Apply this fix:
- if cfg.get("swanlab_enabled", False): + if cfg["swanlab_enabled"]:And update the TypedDict at line 78:
- swanlab_enabled: bool + swanlab_enabled: NotRequired[bool]Then document the recommended default (e.g.,
false) in exemplar YAML configs.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
examples/configs/recipes/llm/grpo-helpsteer3-llama-3.2-1b-1n8g-fsdp2tp1.yaml(1 hunks)examples/configs/recipes/llm/grpo-helpsteer3-llama-3.3-nemotron-super-49b-v1.5-4n8g-fsdp2tp8.yaml(1 hunks)examples/configs/recipes/llm/llama_nemotron_super_49b_custom_plan.py(1 hunks)examples/configs/sft_nemotron_super_49b.yaml(1 hunks)examples/configs/sft_nemotron_super_49b_tulu_v3.yaml(1 hunks)examples/run_grpo_helpsteer3.py(1 hunks)examples/run_sft.py(1 hunks)nemo_rl/data/datasets/preference_datasets/helpsteer3.py(1 hunks)nemo_rl/data/datasets/response_datasets/__init__.py(3 hunks)nemo_rl/data/datasets/response_datasets/tulu3.py(1 hunks)nemo_rl/distributed/ray_actor_environment_registry.py(1 hunks)nemo_rl/environments/helpsteer3_environment.py(1 hunks)nemo_rl/utils/logger.py(2 hunks)
🧰 Additional context used
📓 Path-based instructions (7)
examples/configs/*.yaml
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
examples/configs/*.yaml: Exemplar configs under examples/configs/.yaml must include documented defaults
When adding a new config key, reflect its recommended default in exemplar YAMLs under examples/configs/.yaml
Files:
examples/configs/sft_nemotron_super_49b_tulu_v3.yamlexamples/configs/sft_nemotron_super_49b.yaml
examples/configs/recipes/**/*.yaml
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
examples/configs/recipes/**/*.yaml: Recipe YAMLs under examples/configs/recipes/** are runnable snapshots and may omit documentation
When adding support for a new model, add a recipe YAML under examples/configs/recipes/ in the appropriate domain (llm/ or vlm/) with the correct name
Files:
examples/configs/recipes/llm/grpo-helpsteer3-llama-3.3-nemotron-super-49b-v1.5-4n8g-fsdp2tp8.yamlexamples/configs/recipes/llm/grpo-helpsteer3-llama-3.2-1b-1n8g-fsdp2tp1.yaml
examples/configs/recipes/llm/*.yaml
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
LLM recipe YAML filenames must follow: --ng-[-modifiers][-long][.vN].yaml
Files:
examples/configs/recipes/llm/grpo-helpsteer3-llama-3.3-nemotron-super-49b-v1.5-4n8g-fsdp2tp8.yamlexamples/configs/recipes/llm/grpo-helpsteer3-llama-3.2-1b-1n8g-fsdp2tp1.yaml
examples/configs/recipes/**/*.{yaml,sh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Known exception: Deepscaler recipes may encode context length in place of the cluster tuple (e.g., grpo-deepscaler-1.5b-8K.*); allowed but document intended hardware in the script
Files:
examples/configs/recipes/llm/grpo-helpsteer3-llama-3.3-nemotron-super-49b-v1.5-4n8g-fsdp2tp8.yamlexamples/configs/recipes/llm/grpo-helpsteer3-llama-3.2-1b-1n8g-fsdp2tp1.yaml
examples/configs/recipes/**
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Place recipe YAMLs under examples/configs/recipes//
Files:
examples/configs/recipes/llm/grpo-helpsteer3-llama-3.3-nemotron-super-49b-v1.5-4n8g-fsdp2tp8.yamlexamples/configs/recipes/llm/llama_nemotron_super_49b_custom_plan.pyexamples/configs/recipes/llm/grpo-helpsteer3-llama-3.2-1b-1n8g-fsdp2tp1.yaml
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts
Files:
examples/configs/recipes/llm/llama_nemotron_super_49b_custom_plan.pynemo_rl/utils/logger.pynemo_rl/distributed/ray_actor_environment_registry.pyexamples/run_sft.pynemo_rl/data/datasets/preference_datasets/helpsteer3.pynemo_rl/data/datasets/response_datasets/__init__.pynemo_rl/data/datasets/response_datasets/tulu3.pynemo_rl/environments/helpsteer3_environment.pyexamples/run_grpo_helpsteer3.py
nemo_rl/**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)
Files:
nemo_rl/utils/logger.pynemo_rl/distributed/ray_actor_environment_registry.pynemo_rl/data/datasets/preference_datasets/helpsteer3.pynemo_rl/data/datasets/response_datasets/__init__.pynemo_rl/data/datasets/response_datasets/tulu3.pynemo_rl/environments/helpsteer3_environment.py
🧠 Learnings (4)
📚 Learning: 2025-09-18T14:57:31.003Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1006
File: nemo_rl/algorithms/distillation.py:312-354
Timestamp: 2025-09-18T14:57:31.003Z
Learning: The distillation algorithm's cluster setup logic is designed to follow the same patterns used in GRPO for handling distributed training clusters and resource allocation.
Applied to files:
examples/configs/recipes/llm/grpo-helpsteer3-llama-3.3-nemotron-super-49b-v1.5-4n8g-fsdp2tp8.yamlexamples/configs/recipes/llm/grpo-helpsteer3-llama-3.2-1b-1n8g-fsdp2tp1.yaml
📚 Learning: 2025-09-19T03:00:58.662Z
Learnt from: shuo-nvidia
Repo: NVIDIA-NeMo/RL PR: 1006
File: examples/configs/recipes/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-fsdp2tp1.v1.yaml:85-101
Timestamp: 2025-09-19T03:00:58.662Z
Learning: In distillation and GRPO configurations, max_new_tokens is intentionally set to the full context window (max_total_sequence_length) for consistency across the codebase. Overflow cases when prompt + generation tokens exceed max_model_len are handled by safeguards implemented in vllm_worker.py.
Applied to files:
examples/configs/recipes/llm/grpo-helpsteer3-llama-3.3-nemotron-super-49b-v1.5-4n8g-fsdp2tp8.yamlexamples/configs/recipes/llm/grpo-helpsteer3-llama-3.2-1b-1n8g-fsdp2tp1.yaml
📚 Learning: 2025-09-20T14:58:45.492Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-20T14:58:45.492Z
Learning: Applies to examples/configs/recipes/llm/*.yaml : LLM recipe YAML filenames must follow: <algo>-<model>-<nodes>n<gpus>g-<strategy-and-params>[-modifiers][-long][.vN].yaml
Applied to files:
examples/configs/recipes/llm/grpo-helpsteer3-llama-3.2-1b-1n8g-fsdp2tp1.yaml
📚 Learning: 2025-09-20T14:58:45.492Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-20T14:58:45.492Z
Learning: Applies to **/*.py : Target Python 3.12+ for all Python code in NeMo-RL
Applied to files:
nemo_rl/distributed/ray_actor_environment_registry.py
🧬 Code graph analysis (6)
nemo_rl/distributed/ray_actor_environment_registry.py (1)
nemo_rl/distributed/virtual_cluster.py (1)
PY_EXECUTABLES(43-59)
examples/run_sft.py (1)
tests/check_metrics.py (1)
max(30-32)
nemo_rl/data/datasets/response_datasets/__init__.py (1)
nemo_rl/data/datasets/response_datasets/tulu3.py (1)
Tulu3SftMixtureDataset(95-148)
nemo_rl/data/datasets/response_datasets/tulu3.py (2)
nemo_rl/data/interfaces.py (1)
TaskDataSpec(53-86)nemo_rl/data/datasets/preference_datasets/helpsteer3.py (1)
to_preference_data_format(22-55)
nemo_rl/environments/helpsteer3_environment.py (4)
nemo_rl/distributed/batched_data_dict.py (2)
BatchedDataDict(75-860)chunk(199-235)nemo_rl/distributed/virtual_cluster.py (1)
PY_EXECUTABLES(43-59)nemo_rl/environments/interfaces.py (2)
EnvironmentInterface(52-88)EnvironmentReturn(26-49)nemo_rl/environments/utils.py (1)
chunk_list_to_workers(17-61)
examples/run_grpo_helpsteer3.py (10)
nemo_rl/algorithms/utils.py (1)
get_tokenizer(157-288)nemo_rl/data/__init__.py (1)
DataConfig(18-40)nemo_rl/data/datasets/processed_dataset.py (1)
AllTaskProcessedDataset(31-126)nemo_rl/data/datasets/preference_datasets/__init__.py (1)
load_preference_dataset(25-78)nemo_rl/data/interfaces.py (3)
DatumSpec(32-40)TaskDataProcessFnCallable(89-100)TaskDataSpec(53-86)nemo_rl/distributed/ray_actor_environment_registry.py (1)
get_actor_python_env(50-65)nemo_rl/distributed/virtual_cluster.py (1)
init_ray(85-171)nemo_rl/environments/helpsteer3_environment.py (1)
HelpSteer3Environment(141-259)nemo_rl/models/generation/__init__.py (1)
configure_generation_config(24-47)nemo_rl/utils/logger.py (1)
get_next_experiment_dir(1311-1345)
🪛 Ruff (0.14.3)
nemo_rl/data/datasets/response_datasets/tulu3.py
87-87: Avoid specifying long messages outside the exception class
(TRY003)
nemo_rl/environments/helpsteer3_environment.py
70-70: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
81-81: Do not catch blind exception: Exception
(BLE001)
201-201: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
examples/run_grpo_helpsteer3.py
164-164: Unused function argument: seed
(ARG001)
208-208: Unnecessary key check before dictionary access
Replace with dict.get
(RUF019)
278-278: Unpacked variable cluster is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Post submodule check comment / Comment on PR
- GitHub Check: Post automodel integration comment / Comment on PR
🔇 Additional comments (6)
examples/run_sft.py (1)
35-35: LGTM!The
maxresolver is correctly registered and used by the new YAML configs (e.g.,sft_nemotron_super_49b_tulu_v3.yamlat line 57) for computingmake_sequence_length_divisible_by.examples/configs/sft_nemotron_super_49b_tulu_v3.yaml (1)
1-115: LGTM!The configuration follows established patterns and correctly uses the
maxOmegaConf resolver for computing sequence length constraints. The structure aligns well with related Nemotron/GRPO configurations introduced in this PR.examples/configs/sft_nemotron_super_49b.yaml (1)
1-134: LGTM!The configuration is well-structured with helpful commented examples for alternative models and datasets. The use of the
maxresolver and the distributed training configurations align with the patterns established in related configs.nemo_rl/data/datasets/response_datasets/__init__.py (1)
97-104: LGTM!The integration of
Tulu3SftMixtureDatasetfollows the established pattern for dataset loading, with appropriate parameter forwarding and console logging.nemo_rl/data/datasets/preference_datasets/helpsteer3.py (1)
54-54: LGTM!Adding the
task_namefield provides useful task identification for GRPO workflows without altering the existing preference data selection logic.nemo_rl/distributed/ray_actor_environment_registry.py (1)
39-39: LGTM!The registry mapping for
HelpSteer3Environmentcorrectly usesPY_EXECUTABLES.SYSTEM, consistent with other environment actors that don't require special dependencies.
| cluster: | ||
| gpus_per_node: 8 | ||
| num_nodes: 16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix cluster node count mismatch.
The recipe advertises 4n8g, but the cluster.num_nodes value is hard-coded to 16. This will over-request hardware and breaks the naming contract for runnable snapshots. Please drop num_nodes back to 4 (or rename the file/config) so the declared topology matches the actual request.
cluster:
gpus_per_node: 8
- num_nodes: 16
+ num_nodes: 4📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| cluster: | |
| gpus_per_node: 8 | |
| num_nodes: 16 | |
| cluster: | |
| gpus_per_node: 8 | |
| num_nodes: 4 |
🤖 Prompt for AI Agents
In
examples/configs/recipes/llm/grpo-helpsteer3-llama-3.3-nemotron-super-49b-v1.5-4n8g-fsdp2tp8.yaml
around lines 189–191, the cluster topology is inconsistent with the
filename/advertised topology: gpus_per_node: 8 and num_nodes: 16 (file
advertises 4n8g). Change cluster.num_nodes to 4 to match the 4-node,
8-GPU-per-node topology (or alternatively rename the file/config to reflect 16
nodes), ensuring the declared topology matches the actual resource request and
naming contract.
| if False: | ||
| # Enable sequence parallelism only if TP size > 1 | ||
| base_model_tp_plan.update(cast(dict[str, ParallelStyle], base_model_sp_plan)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dead code requires clarification or removal.
The if False: block will never execute. Per coding guidelines: commented-out or dead code must include a nearby comment explaining usage and why it's disabled, otherwise it should be removed before merging.
If sequence parallelism support is planned for future work, add a clear TODO comment explaining the intent and conditions for enabling it. Otherwise, remove this block.
If keeping for future work:
+ # TODO: Enable sequence parallelism when TP size > 1
+ # Requires configuration plumbing to pass TP size to this function
if False:
- # Enable sequence parallelism only if TP size > 1
base_model_tp_plan.update(cast(dict[str, ParallelStyle], base_model_sp_plan))Or remove if not planned:
- if False:
- # Enable sequence parallelism only if TP size > 1
- base_model_tp_plan.update(cast(dict[str, ParallelStyle], base_model_sp_plan))
-📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if False: | |
| # Enable sequence parallelism only if TP size > 1 | |
| base_model_tp_plan.update(cast(dict[str, ParallelStyle], base_model_sp_plan)) | |
| # TODO: Enable sequence parallelism when TP size > 1 | |
| # Requires configuration plumbing to pass TP size to this function | |
| if False: | |
| base_model_tp_plan.update(cast(dict[str, ParallelStyle], base_model_sp_plan)) |
| if False: | |
| # Enable sequence parallelism only if TP size > 1 | |
| base_model_tp_plan.update(cast(dict[str, ParallelStyle], base_model_sp_plan)) |
🤖 Prompt for AI Agents
In examples/configs/recipes/llm/llama_nemotron_super_49b_custom_plan.py around
lines 54 to 56, there is a dead `if False:` block that never runs; either remove
the block entirely or replace it with a clear TODO comment explaining that
sequence-parallelism support is planned for future work, under what conditions
it should be enabled (e.g., when tensor-parallel size > 1 and sequence
parallelism implemented), and include a short example or link to the tracking
ticket; if you remove it, delete the three lines and any unused imports or
casts; if you keep it, change `if False:` to a commented TODO and keep the
explanatory note and required enabling conditions.
| for i in range(1, len(message_log)): | ||
| message_log[i]["token_ids"] = tokenizer("", return_tensors="pt")["input_ids"][0] # Empty tensor | ||
|
|
||
| length = sum(len(m["token_ids"]) for m in message_log) | ||
|
|
||
| # Create ground truth from the preferred completion for environment evaluation | ||
| ground_truth = " ".join([msg["content"] for msg in preferred_completion]) | ||
| extra_env_info = {"ground_truth": ground_truth} | ||
|
|
||
| loss_multiplier = 1.0 | ||
| if length > max_seq_length: | ||
| # Truncate if too long | ||
| for chat_message in message_log: | ||
| chat_message["token_ids"] = chat_message["token_ids"][ | ||
| : min(max_seq_length // len(message_log), len(chat_message["token_ids"])) | ||
| ] | ||
| loss_multiplier = 0.1 # Reduce loss for truncated sequences |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix message token clearing and truncation math.
Two problems here corrupt the training data:
- Clearing per-message tokens via
tokenizer("")still yields a BOS token, so every non-head message contributes stray tokens tolength. - When
length > max_seq_length, slicing each message tomax_seq_length // len(message_log)throws away most of the conversation and leaveslengthunrecomputed, so downstream batching still treats the sample as over-length.
Please zero the non-head tensors with new_empty(0) (or similar) and, on truncation, only clip the head tensor to max_seq_length, then recompute length to reflect the actual token count.
- for i in range(1, len(message_log)):
- message_log[i]["token_ids"] = tokenizer("", return_tensors="pt")["input_ids"][0] # Empty tensor
-
- length = sum(len(m["token_ids"]) for m in message_log)
-
- if length > max_seq_length:
- # Truncate if too long
- for chat_message in message_log:
- chat_message["token_ids"] = chat_message["token_ids"][
- : min(max_seq_length // len(message_log), len(chat_message["token_ids"]))
- ]
- loss_multiplier = 0.1 # Reduce loss for truncated sequences
+ empty_tokens = message_log[0]["token_ids"].new_empty((0,))
+ for i in range(1, len(message_log)):
+ message_log[i]["token_ids"] = empty_tokens
+
+ length = sum(len(m["token_ids"]) for m in message_log)
+
+ if max_seq_length is not None and length > max_seq_length:
+ message_log[0]["token_ids"] = message_log[0]["token_ids"][:max_seq_length]
+ loss_multiplier = 0.1 # Reduce loss for truncated sequences
+ length = sum(len(m["token_ids"]) for m in message_log)| return { | ||
| "context": context, | ||
| "completions": [ | ||
| { | ||
| "rank": 0, | ||
| "completion": [{"role": "assistant", "content": chosen_response}], | ||
| }, | ||
| { | ||
| "rank": 1, | ||
| "completion": [{"role": "assistant", "content": rejected_response}], | ||
| }, | ||
| ], | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Include task_name in preference records.
Every entry returned by to_preference_data_format needs a task_name so AllTaskProcessedDataset can resolve the correct processor. Right now the map in Tulu3PreferenceDataset will hit the assertion (task processor not provided for ...) because the key is missing entirely. Add the "task_name": "Tulu3Preference" field to each record.
return {
"context": context,
"completions": [
{
"rank": 0,
"completion": [{"role": "assistant", "content": chosen_response}],
},
{
"rank": 1,
"completion": [{"role": "assistant", "content": rejected_response}],
},
],
+ "task_name": "Tulu3Preference",
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| return { | |
| "context": context, | |
| "completions": [ | |
| { | |
| "rank": 0, | |
| "completion": [{"role": "assistant", "content": chosen_response}], | |
| }, | |
| { | |
| "rank": 1, | |
| "completion": [{"role": "assistant", "content": rejected_response}], | |
| }, | |
| ], | |
| } | |
| return { | |
| "context": context, | |
| "completions": [ | |
| { | |
| "rank": 0, | |
| "completion": [{"role": "assistant", "content": chosen_response}], | |
| }, | |
| { | |
| "rank": 1, | |
| "completion": [{"role": "assistant", "content": rejected_response}], | |
| }, | |
| ], | |
| "task_name": "Tulu3Preference", | |
| } |
🤖 Prompt for AI Agents
In nemo_rl/data/datasets/response_datasets/tulu3.py around lines 52 to 64, the
preference record dicts returned by to_preference_data_format are missing the
required "task_name" key; add the field "task_name": "Tulu3Preference" to every
top-level preference record (the dict that currently contains "context" and
"completions") so AllTaskProcessedDataset can resolve the processor; ensure the
key/value is included alongside "context" and "completions" in both returned
records.
| if cfg.get("swanlab_enabled", False) and self.swanlab_logger: | ||
| self.swanlab_logger.define_metric( | ||
| f"{metric_prefix}/*", step_metric=step_metric | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same configuration default issue.
Same violation as line 818: cfg.get("swanlab_enabled", False) should be cfg["swanlab_enabled"] after marking the field as NotRequired[bool] in the TypedDict and documenting the default in YAML configs.
Apply this fix:
- if cfg.get("swanlab_enabled", False) and self.swanlab_logger:
+ if cfg["swanlab_enabled"] and self.swanlab_logger:Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In nemo_rl/utils/logger.py around lines 848 to 851, change the
cfg.get("swanlab_enabled", False) usage to direct indexing
cfg["swanlab_enabled"] (remove the False default) because the TypedDict field
has been marked NotRequired[bool] and the default is documented in YAML; update
the line to use cfg["swanlab_enabled"] and keep the existing self.swanlab_logger
check unchanged so the code relies on the documented default rather than a local
fallback.
yuki-97
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the contribution @nv-mmanohara ! left some comments, please ping me if you need any helps.
examples/configs/recipes/llm/grpo-helpsteer3-llama-3.2-1b-1n8g-fsdp2tp1.yaml
Show resolved
Hide resolved
examples/configs/recipes/llm/grpo-helpsteer3-llama-3.2-1b-1n8g-fsdp2tp1.yaml
Show resolved
Hide resolved
| @@ -12,38 +12,73 @@ | |||
| # See the License for the specific language governing permissions and | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realize this file and examples/custom_parallel.py are better to be put into another place, e.g., examples/custom_parallel_plan/.
can you help to move them or do you mind me helping to do this on your branch directly?
| @@ -0,0 +1,134 @@ | |||
| # SFT Algorithm Configuration | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this file and another sft yaml are better to move to under examples/configs/recipes/llm and change to the same name format like grpo-helpsteer3-llama-3.2-1b-1n8g-fsdp2tp1.yaml, can you help to move?
| @@ -0,0 +1,305 @@ | |||
| # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is this file different from examples/run_grpo_math.py? if there are minor changes, can we just update that file instead of creating a new one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the code is redundant but many configurations are hardcoded and each dataset needs a specific preprocessing strategy, we’ll probably end up duplicating files for new datasets.
| } | ||
|
|
||
|
|
||
| class Tulu3PreferenceDataset: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are Tulu3PreferenceDataset and to_preference_data_format in the file useful? can we remove them if not?
| self.task_spec = TaskDataSpec( | ||
| task_name="Tulu3SftMixture", | ||
| prompt_file=prompt_file, | ||
| ) No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: typo
| def __init__(self) -> None: | ||
| pass | ||
|
|
||
| def verify( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just curious, how is this environment different with the exist environment? can we do something to our exist environment instead of adding a new one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as the case for run_grpo_helpsteer3.py file.
| self.loggers.append(self.wandb_logger) | ||
|
|
||
| if cfg["swanlab_enabled"]: | ||
| if cfg.get("swanlab_enabled", False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change is no need any more after inheriting the yaml. also for the below one.
What does this PR do ?
GRPO support for HelpSteer3 on LlamaNemotron 49B.
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes