Skip to content

Conversation

@maitchison
Copy link
Contributor

Motivation

GRPO style centering introduces bias into the gradient estimate. The bias introduced is equal to $$\frac{G-1}{G}$$, where $$G$$ is the group size. While this factor is generally small, and can be incorporated into the learning rate, it would be prefered to not have learning rate dependant on group size in this way.

References

Changes

  • Added a correction to the advantages to adjust for the bias introduced by GRPO reward centering.

@Tiiiger
Copy link
Collaborator

Tiiiger commented Dec 2, 2025

hi @maitchison thanks for the contribution! but I don't think this technique is super common these days so we will not have it in the official cookbook. Feel free to add it in your experiments though.

@maitchison
Copy link
Contributor Author

OK, no worries, thanks for taking a look. I've now closed the PR.

@maitchison maitchison closed this Dec 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants