-
Notifications
You must be signed in to change notification settings - Fork 167
feat: pipeline-rl style # of inflight prompt regulation #1499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughThis PR introduces a new configuration option Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (1)
nemo_rl/algorithms/async_utils.py (1)
281-292: Add runtime validation formax_num_in_flight_batches_in_generationconfiguration bounds.Verification confirms that no runtime validation exists for the documented constraint
1 <= max_num_in_flight_batches_in_generation <= max_trajectory_age_steps(defined innemo_rl/algorithms/grpo.py:113). While theor 1fallback atnemo_rl/algorithms/async_utils.py:291protects the lower bound, the upper bound is unchecked, which could lead to excessive in-flight requests and memory issues.Add a validation check in
nemo_rl/algorithms/async_utils.pyaround line 284-291 to assert thatmax_num_in_flight_batches_in_generationdoes not exceedmax_trajectory_age_steps.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
docs/guides/async-grpo.md(3 hunks)examples/configs/grpo_math_1B.yaml(1 hunks)nemo_rl/algorithms/async_utils.py(1 hunks)nemo_rl/algorithms/grpo.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts
Files:
nemo_rl/algorithms/grpo.pynemo_rl/algorithms/async_utils.py
nemo_rl/**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)
Files:
nemo_rl/algorithms/grpo.pynemo_rl/algorithms/async_utils.py
examples/configs/*.yaml
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
examples/configs/*.yaml: Exemplar configs under examples/configs/.yaml must include documented defaults
When adding a new config key, reflect its recommended default in exemplar YAMLs under examples/configs/.yaml
Files:
examples/configs/grpo_math_1B.yaml
docs/**/*.md
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
When a markdown doc under docs/**/*.md is added or renamed, update docs/index.md to include it in the appropriate section
Files:
docs/guides/async-grpo.md
🧠 Learnings (3)
📓 Common learnings
Learnt from: shuo-nvidia
Repo: NVIDIA-NeMo/RL PR: 1006
File: examples/configs/recipes/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-fsdp2tp1.v1.yaml:85-101
Timestamp: 2025-09-19T03:00:58.662Z
Learning: In distillation and GRPO configurations, max_new_tokens is intentionally set to the full context window (max_total_sequence_length) for consistency across the codebase. Overflow cases when prompt + generation tokens exceed max_model_len are handled by safeguards implemented in vllm_worker.py.
📚 Learning: 2025-09-19T03:00:58.662Z
Learnt from: shuo-nvidia
Repo: NVIDIA-NeMo/RL PR: 1006
File: examples/configs/recipes/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-fsdp2tp1.v1.yaml:85-101
Timestamp: 2025-09-19T03:00:58.662Z
Learning: In distillation and GRPO configurations, max_new_tokens is intentionally set to the full context window (max_total_sequence_length) for consistency across the codebase. Overflow cases when prompt + generation tokens exceed max_model_len are handled by safeguards implemented in vllm_worker.py.
Applied to files:
nemo_rl/algorithms/grpo.pyexamples/configs/grpo_math_1B.yamldocs/guides/async-grpo.md
📚 Learning: 2025-09-20T14:58:45.492Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-20T14:58:45.492Z
Learning: Applies to nemo_rl/**/*.py : When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
Applied to files:
nemo_rl/algorithms/grpo.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Lint check
- GitHub Check: Post automodel integration comment / Comment on PR
- GitHub Check: Post submodule check comment / Comment on PR
b96d847 to
a67c958
Compare
Signed-off-by: Youngeun Kwon <[email protected]>
a67c958 to
54e45e9
Compare
| # Allowed values are 1 <= max_num_in_flight_batches_in_generation <= max_trajectory_age_steps | ||
| # Maximum number of in-flight prompts will be max_num_in_flight_batches_in_generation * num_prompts_per_step | ||
| # By having lower max_num_in_flight_batches_in_generation, we could reduce the avg trajectory age, but might also reduce the inference throughput. | ||
| max_num_in_flight_batches_in_generation: NotRequired[int] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not 100% convinced about the advantage of this flag for our async implementation. We would always want to aim for max throughput rollouts so as to fully utilize the rollout gpus. If the user indeed intends to control off policy factor, they should use the max trajectory age.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is to test the benefit of the Pipeline-RL-style implementation.
A potential target case might be like this:
- A user wants max 2-off for most of the samples, but there are a few very-long generating samples that cause the straggler effect, which requires 8-off to hide the latency.
In the above case, in the current ToT, if we choose 2-off, it will cause long-exposed generation; if we choose 8-off, it will make everything 8-off.
With this PR, by setting max_trajectory_age_steps to a large number (e.g., 8), and setting max_num_in_flight_batches_in_generation=2 might be able to achieve both goals.
But in general, I agree with your comment. Controlling the average age and throughput in this way is too complex for the users. I am okay with holding this PR until we get strong proof that this feature gives any benefit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the analysis, but we can fix it a better way than introducing more flags. Our current implementation is very strict, i.e a trajectory while it is enqueued knows exactly when in the future will it be used. We need to relax this condition such that the decision of what goes into a training batch is FIFO when the trajectory age fits within the budget.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the FIFO restriction is limiting both performance and the capability of minimizing the average age of the trajectory. We may need more discussions on how to improve this.
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit
New Features
max_num_in_flight_batches_in_generationconfiguration parameter to control the number of in-flight prompt batches during generation, enabling fine-tuning of throughput versus off-policyness tradeoffs.Documentation