Correct for gradient bias in GRPO style reward centering. #135

maitchison · 2025-11-27T23:50:53Z

Motivation

GRPO style centering introduces bias into the gradient estimate. The bias introduced is equal to $$\frac{G-1}{G}$$, where $$G$$ is the group size. While this factor is generally small, and can be incorporated into the learning rate, it would be prefered to not have learning rate dependant on group size in this way.

References

See https://arxiv.org/pdf/2503.20783 (Pg.14)

Changes

Added a correction to the advantages to adjust for the bias introduced by GRPO reward centering.

Tiiiger · 2025-12-02T02:13:28Z

hi @maitchison thanks for the contribution! but I don't think this technique is super common these days so we will not have it in the official cookbook. Feel free to add it in your experiments though.

maitchison · 2025-12-02T03:35:22Z

OK, no worries, thanks for taking a look. I've now closed the PR.

Matthew Aitchison added 6 commits November 28, 2025 12:40

debias the gradient

239cf53

correct indentation

1725945

use correct ratio

690f925

fix typo

0f838c0

fix formatting for ruff

4016afc

another ruff formatting fix

d6c8541

maitchison closed this Dec 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Correct for gradient bias in GRPO style reward centering. #135

Correct for gradient bias in GRPO style reward centering. #135

Uh oh!

maitchison commented Nov 27, 2025

Uh oh!

Tiiiger commented Dec 2, 2025

Uh oh!

maitchison commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Correct for gradient bias in GRPO style reward centering. #135

Correct for gradient bias in GRPO style reward centering. #135

Uh oh!

Conversation

maitchison commented Nov 27, 2025

Motivation

Uh oh!

Tiiiger commented Dec 2, 2025

Uh oh!

maitchison commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants