feat: pipeline-rl style # of inflight prompt regulation #1499

youngeunkwon0405 · 2025-11-10T22:52:56Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

New Features
- Added max_num_in_flight_batches_in_generation configuration parameter to control the number of in-flight prompt batches during generation, enabling fine-tuning of throughput versus off-policyness tradeoffs.
Documentation
- Added guidance on using the new parameter, including recommended settings for maximizing throughput and managing training efficiency.

coderabbitai · 2025-11-10T22:57:56Z

📝 Walkthrough

Walkthrough

This PR introduces a new configuration option max_num_in_flight_batches_in_generation to the async GRPO algorithm, enabling explicit control over the maximum number of in-flight prompt batches during generation. The change includes documentation, configuration schema updates, and implementation modifications to use this parameter instead of deriving it directly from max_trajectory_age_steps.

Changes

Cohort / File(s)	Summary
Documentation Updates `docs/guides/async-grpo.md`	Added new section on controlling max in-flight batches, including configuration details, valid range constraints (1 ≤ value ≤ max_trajectory_age_steps), throughput guidance, and off-policyness trade-offs.
Configuration Schema & Examples `examples/configs/grpo_math_1B.yaml`	Added `max_num_in_flight_batches_in_generation` field under `grpo.async_grpo` with default value referencing `max_trajectory_age_steps`, including documentation comments explaining range, effect on in-flight prompts calculation, and parameter interaction.
Implementation `nemo_rl/algorithms/async_utils.py`, `nemo_rl/algorithms/grpo.py`	Updated async GRPO configuration to use new `max_num_in_flight_batches_in_generation` parameter for computing in-flight prompt limits. Changed multiplier derivation in async utilities from directly using `max_trajectory_age_steps` to using the new explicit configuration field.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Implementation changes are straightforward parameter substitution without complex logic modifications
Configuration schema addition is a direct field extension with no validation logic changes
Changes are homogeneous and scoped to a single feature across consistent file patterns

Suggested labels

documentation, asyncRL

Suggested reviewers

parthchadha
terrykong

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Test Results For Major Changes	⚠️ Warning	PR introduces major async GRPO feature affecting parallelism but lacks test results, performance metrics, or convergence verification.	Add comprehensive testing information including test results, performance benchmarks, convergence verification, and reference any new tests added to validate the feature.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: adding pipeline-RL style regulation for in-flight prompts, which is the core feature across all modified files.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch youngeunk/pipeline-rl

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (1)

nemo_rl/algorithms/async_utils.py (1)

281-292: Add runtime validation for max_num_in_flight_batches_in_generation configuration bounds.

Verification confirms that no runtime validation exists for the documented constraint 1 <= max_num_in_flight_batches_in_generation <= max_trajectory_age_steps (defined in nemo_rl/algorithms/grpo.py:113). While the or 1 fallback at nemo_rl/algorithms/async_utils.py:291 protects the lower bound, the upper bound is unchecked, which could lead to excessive in-flight requests and memory issues.

Add a validation check in nemo_rl/algorithms/async_utils.py around line 284-291 to assert that max_num_in_flight_batches_in_generation does not exceed max_trajectory_age_steps.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6a035bc and b96d847.

📒 Files selected for processing (4)

docs/guides/async-grpo.md (3 hunks)
examples/configs/grpo_math_1B.yaml (1 hunks)
nemo_rl/algorithms/async_utils.py (1 hunks)
nemo_rl/algorithms/grpo.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (4)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts

Files:

nemo_rl/algorithms/grpo.py
nemo_rl/algorithms/async_utils.py

nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)

Files:

nemo_rl/algorithms/grpo.py
nemo_rl/algorithms/async_utils.py

examples/configs/*.yaml

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

examples/configs/*.yaml: Exemplar configs under examples/configs/.yaml must include documented defaults
When adding a new config key, reflect its recommended default in exemplar YAMLs under examples/configs/.yaml

Files:

examples/configs/grpo_math_1B.yaml

docs/**/*.md

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

When a markdown doc under docs/**/*.md is added or renamed, update docs/index.md to include it in the appropriate section

Files:

docs/guides/async-grpo.md

🧠 Learnings (3)

📓 Common learnings

Learnt from: shuo-nvidia
Repo: NVIDIA-NeMo/RL PR: 1006
File: examples/configs/recipes/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-fsdp2tp1.v1.yaml:85-101
Timestamp: 2025-09-19T03:00:58.662Z
Learning: In distillation and GRPO configurations, max_new_tokens is intentionally set to the full context window (max_total_sequence_length) for consistency across the codebase. Overflow cases when prompt + generation tokens exceed max_model_len are handled by safeguards implemented in vllm_worker.py.

📚 Learning: 2025-09-19T03:00:58.662Z

Learnt from: shuo-nvidia
Repo: NVIDIA-NeMo/RL PR: 1006
File: examples/configs/recipes/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-fsdp2tp1.v1.yaml:85-101
Timestamp: 2025-09-19T03:00:58.662Z
Learning: In distillation and GRPO configurations, max_new_tokens is intentionally set to the full context window (max_total_sequence_length) for consistency across the codebase. Overflow cases when prompt + generation tokens exceed max_model_len are handled by safeguards implemented in vllm_worker.py.

Applied to files:

nemo_rl/algorithms/grpo.py
examples/configs/grpo_math_1B.yaml
docs/guides/async-grpo.md

📚 Learning: 2025-09-20T14:58:45.492Z

Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-20T14:58:45.492Z
Learning: Applies to nemo_rl/**/*.py : When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code

Applied to files:

nemo_rl/algorithms/grpo.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Lint check
GitHub Check: Post automodel integration comment / Comment on PR
GitHub Check: Post submodule check comment / Comment on PR

docs/guides/async-grpo.md

examples/configs/grpo_math_1B.yaml

nemo_rl/algorithms/grpo.py

Signed-off-by: Youngeun Kwon <[email protected]>

parthchadha · 2025-11-11T18:27:57Z

nemo_rl/algorithms/grpo.py

+    # Allowed values are 1 <= max_num_in_flight_batches_in_generation <= max_trajectory_age_steps
+    # Maximum number of in-flight prompts will be max_num_in_flight_batches_in_generation * num_prompts_per_step
+    # By having lower max_num_in_flight_batches_in_generation, we could reduce the avg trajectory age, but might also reduce the inference throughput.
+    max_num_in_flight_batches_in_generation: NotRequired[int]


I am not 100% convinced about the advantage of this flag for our async implementation. We would always want to aim for max throughput rollouts so as to fully utilize the rollout gpus. If the user indeed intends to control off policy factor, they should use the max trajectory age.

This PR is to test the benefit of the Pipeline-RL-style implementation.

A potential target case might be like this:

A user wants max 2-off for most of the samples, but there are a few very-long generating samples that cause the straggler effect, which requires 8-off to hide the latency.

In the above case, in the current ToT, if we choose 2-off, it will cause long-exposed generation; if we choose 8-off, it will make everything 8-off.

With this PR, by setting max_trajectory_age_steps to a large number (e.g., 8), and setting max_num_in_flight_batches_in_generation=2 might be able to achieve both goals.

But in general, I agree with your comment. Controlling the average age and throughput in this way is too complex for the users. I am okay with holding this PR until we get strong proof that this feature gives any benefit.

I agree with the analysis, but we can fix it a better way than introducing more flags. Our current implementation is very strict, i.e a trajectory while it is enqueued knows exactly when in the future will it be used. We need to relax this condition such that the decision of what goes into a training batch is FIFO when the trajectory age fits within the budget.

I agree that the FIFO restriction is limiting both performance and the capability of minimizing the average age of the trajectory. We may need more discussions on how to improve this.

youngeunkwon0405 self-assigned this Nov 10, 2025

youngeunkwon0405 requested review from a team as code owners November 10, 2025 22:52

github-actions bot added the documentation Improvements or additions to documentation label Nov 10, 2025

youngeunkwon0405 requested review from guyueh1 and parthchadha November 10, 2025 22:53

coderabbitai bot reviewed Nov 10, 2025

View reviewed changes

youngeunkwon0405 force-pushed the youngeunk/pipeline-rl branch from b96d847 to a67c958 Compare November 11, 2025 18:22

draft

54e45e9

Signed-off-by: Youngeun Kwon <[email protected]>

youngeunkwon0405 force-pushed the youngeunk/pipeline-rl branch from a67c958 to 54e45e9 Compare November 11, 2025 18:22

youngeunkwon0405 added the CI:L1 Run doctests, unit tests, and functional tests label Nov 11, 2025

youngeunkwon0405 temporarily deployed to nemo-ci November 11, 2025 18:23 — with GitHub Actions Inactive

parthchadha reviewed Nov 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: pipeline-rl style # of inflight prompt regulation #1499

feat: pipeline-rl style # of inflight prompt regulation #1499

youngeunkwon0405 commented Nov 10, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 10, 2025

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

parthchadha Nov 11, 2025

Uh oh!

youngeunkwon0405 Nov 11, 2025

Uh oh!

parthchadha Nov 11, 2025

Uh oh!

youngeunkwon0405 Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: pipeline-rl style # of inflight prompt regulation #1499

Are you sure you want to change the base?

feat: pipeline-rl style # of inflight prompt regulation #1499

Conversation

youngeunkwon0405 commented Nov 10, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 10, 2025

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

parthchadha Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

youngeunkwon0405 Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

parthchadha Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

youngeunkwon0405 Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

youngeunkwon0405 commented Nov 10, 2025 •

edited by coderabbitai bot

Loading