Skip to content

Conversation

@youngeunkwon0405
Copy link
Contributor

@youngeunkwon0405 youngeunkwon0405 commented Nov 10, 2025

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • New Features

    • Added max_num_in_flight_batches_in_generation configuration parameter to control the number of in-flight prompt batches during generation, enabling fine-tuning of throughput versus off-policyness tradeoffs.
  • Documentation

    • Added guidance on using the new parameter, including recommended settings for maximizing throughput and managing training efficiency.

@youngeunkwon0405 youngeunkwon0405 self-assigned this Nov 10, 2025
@youngeunkwon0405 youngeunkwon0405 requested review from a team as code owners November 10, 2025 22:52
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Nov 10, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 10, 2025

📝 Walkthrough

Walkthrough

This PR introduces a new configuration option max_num_in_flight_batches_in_generation to the async GRPO algorithm, enabling explicit control over the maximum number of in-flight prompt batches during generation. The change includes documentation, configuration schema updates, and implementation modifications to use this parameter instead of deriving it directly from max_trajectory_age_steps.

Changes

Cohort / File(s) Summary
Documentation Updates
docs/guides/async-grpo.md
Added new section on controlling max in-flight batches, including configuration details, valid range constraints (1 ≤ value ≤ max_trajectory_age_steps), throughput guidance, and off-policyness trade-offs.
Configuration Schema & Examples
examples/configs/grpo_math_1B.yaml
Added max_num_in_flight_batches_in_generation field under grpo.async_grpo with default value referencing max_trajectory_age_steps, including documentation comments explaining range, effect on in-flight prompts calculation, and parameter interaction.
Implementation
nemo_rl/algorithms/async_utils.py, nemo_rl/algorithms/grpo.py
Updated async GRPO configuration to use new max_num_in_flight_batches_in_generation parameter for computing in-flight prompt limits. Changed multiplier derivation in async utilities from directly using max_trajectory_age_steps to using the new explicit configuration field.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Implementation changes are straightforward parameter substitution without complex logic modifications
  • Configuration schema addition is a direct field extension with no validation logic changes
  • Changes are homogeneous and scoped to a single feature across consistent file patterns

Suggested labels

documentation, asyncRL

Suggested reviewers

  • parthchadha
  • terrykong

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Test Results For Major Changes ⚠️ Warning PR introduces major async GRPO feature affecting parallelism but lacks test results, performance metrics, or convergence verification. Add comprehensive testing information including test results, performance benchmarks, convergence verification, and reference any new tests added to validate the feature.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding pipeline-RL style regulation for in-flight prompts, which is the core feature across all modified files.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch youngeunk/pipeline-rl

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (1)
nemo_rl/algorithms/async_utils.py (1)

281-292: Add runtime validation for max_num_in_flight_batches_in_generation configuration bounds.

Verification confirms that no runtime validation exists for the documented constraint 1 <= max_num_in_flight_batches_in_generation <= max_trajectory_age_steps (defined in nemo_rl/algorithms/grpo.py:113). While the or 1 fallback at nemo_rl/algorithms/async_utils.py:291 protects the lower bound, the upper bound is unchecked, which could lead to excessive in-flight requests and memory issues.

Add a validation check in nemo_rl/algorithms/async_utils.py around line 284-291 to assert that max_num_in_flight_batches_in_generation does not exceed max_trajectory_age_steps.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6a035bc and b96d847.

📒 Files selected for processing (4)
  • docs/guides/async-grpo.md (3 hunks)
  • examples/configs/grpo_math_1B.yaml (1 hunks)
  • nemo_rl/algorithms/async_utils.py (1 hunks)
  • nemo_rl/algorithms/grpo.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts

Files:

  • nemo_rl/algorithms/grpo.py
  • nemo_rl/algorithms/async_utils.py
nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)

Files:

  • nemo_rl/algorithms/grpo.py
  • nemo_rl/algorithms/async_utils.py
examples/configs/*.yaml

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

examples/configs/*.yaml: Exemplar configs under examples/configs/.yaml must include documented defaults
When adding a new config key, reflect its recommended default in exemplar YAMLs under examples/configs/
.yaml

Files:

  • examples/configs/grpo_math_1B.yaml
docs/**/*.md

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

When a markdown doc under docs/**/*.md is added or renamed, update docs/index.md to include it in the appropriate section

Files:

  • docs/guides/async-grpo.md
🧠 Learnings (3)
📓 Common learnings
Learnt from: shuo-nvidia
Repo: NVIDIA-NeMo/RL PR: 1006
File: examples/configs/recipes/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-fsdp2tp1.v1.yaml:85-101
Timestamp: 2025-09-19T03:00:58.662Z
Learning: In distillation and GRPO configurations, max_new_tokens is intentionally set to the full context window (max_total_sequence_length) for consistency across the codebase. Overflow cases when prompt + generation tokens exceed max_model_len are handled by safeguards implemented in vllm_worker.py.
📚 Learning: 2025-09-19T03:00:58.662Z
Learnt from: shuo-nvidia
Repo: NVIDIA-NeMo/RL PR: 1006
File: examples/configs/recipes/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-fsdp2tp1.v1.yaml:85-101
Timestamp: 2025-09-19T03:00:58.662Z
Learning: In distillation and GRPO configurations, max_new_tokens is intentionally set to the full context window (max_total_sequence_length) for consistency across the codebase. Overflow cases when prompt + generation tokens exceed max_model_len are handled by safeguards implemented in vllm_worker.py.

Applied to files:

  • nemo_rl/algorithms/grpo.py
  • examples/configs/grpo_math_1B.yaml
  • docs/guides/async-grpo.md
📚 Learning: 2025-09-20T14:58:45.492Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-20T14:58:45.492Z
Learning: Applies to nemo_rl/**/*.py : When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code

Applied to files:

  • nemo_rl/algorithms/grpo.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Lint check
  • GitHub Check: Post automodel integration comment / Comment on PR
  • GitHub Check: Post submodule check comment / Comment on PR

Signed-off-by: Youngeun Kwon <[email protected]>
@youngeunkwon0405 youngeunkwon0405 added the CI:L1 Run doctests, unit tests, and functional tests label Nov 11, 2025
# Allowed values are 1 <= max_num_in_flight_batches_in_generation <= max_trajectory_age_steps
# Maximum number of in-flight prompts will be max_num_in_flight_batches_in_generation * num_prompts_per_step
# By having lower max_num_in_flight_batches_in_generation, we could reduce the avg trajectory age, but might also reduce the inference throughput.
max_num_in_flight_batches_in_generation: NotRequired[int]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% convinced about the advantage of this flag for our async implementation. We would always want to aim for max throughput rollouts so as to fully utilize the rollout gpus. If the user indeed intends to control off policy factor, they should use the max trajectory age.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is to test the benefit of the Pipeline-RL-style implementation.

A potential target case might be like this:

  • A user wants max 2-off for most of the samples, but there are a few very-long generating samples that cause the straggler effect, which requires 8-off to hide the latency.

In the above case, in the current ToT, if we choose 2-off, it will cause long-exposed generation; if we choose 8-off, it will make everything 8-off.

With this PR, by setting max_trajectory_age_steps to a large number (e.g., 8), and setting max_num_in_flight_batches_in_generation=2 might be able to achieve both goals.

But in general, I agree with your comment. Controlling the average age and throughput in this way is too complex for the users. I am okay with holding this PR until we get strong proof that this feature gives any benefit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the analysis, but we can fix it a better way than introducing more flags. Our current implementation is very strict, i.e a trajectory while it is enqueued knows exactly when in the future will it be used. We need to relax this condition such that the decision of what goes into a training batch is FIFO when the trajectory age fits within the budget.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the FIFO restriction is limiting both performance and the capability of minimizing the average age of the trajectory. We may need more discussions on how to improve this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants