-
Notifications
You must be signed in to change notification settings - Fork 169
feat: pipeline-rl style # of inflight prompt regulation #1499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -108,6 +108,12 @@ class AsyncGRPOConfig(TypedDict): | |
| in_flight_weight_updates: NotRequired[bool] | ||
| # Recomputes the KV cache after the in-flight weight updates. | ||
| recompute_kv_cache_after_weight_updates: NotRequired[bool] | ||
| # Maximum number of in-flight prompts in generation. | ||
| # Required to enable pipeline-rl style async-grpo training. | ||
| # Allowed values are 1 <= max_num_in_flight_batches_in_generation <= max_trajectory_age_steps | ||
| # Maximum number of in-flight prompts will be max_num_in_flight_batches_in_generation * num_prompts_per_step | ||
| # By having lower max_num_in_flight_batches_in_generation, we could reduce the avg trajectory age, but might also reduce the inference throughput. | ||
| max_num_in_flight_batches_in_generation: NotRequired[int] | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not 100% convinced about the advantage of this flag for our async implementation. We would always want to aim for max throughput rollouts so as to fully utilize the rollout gpus. If the user indeed intends to control off policy factor, they should use the max trajectory age.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This PR is to test the benefit of the Pipeline-RL-style implementation. A potential target case might be like this:
In the above case, in the current ToT, if we choose 2-off, it will cause long-exposed generation; if we choose 8-off, it will make everything 8-off. With this PR, by setting But in general, I agree with your comment. Controlling the average age and throughput in this way is too complex for the users. I am okay with holding this PR until we get strong proof that this feature gives any benefit.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with the analysis, but we can fix it a better way than introducing more flags. Our current implementation is very strict, i.e a trajectory while it is enqueued knows exactly when in the future will it be used. We need to relax this condition such that the decision of what goes into a training batch is FIFO when the trajectory age fits within the budget.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree that the FIFO restriction is limiting both performance and the capability of minimizing the average age of the trajectory. We may need more discussions on how to improve this. |
||
|
|
||
|
|
||
| class GRPOConfig(TypedDict): | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.