Skip to content

Conversation

@sdan
Copy link

@sdan sdan commented Nov 16, 2025

When a trajectory group is 1, centering advantage within that 1 group is: reward - reward = 0.

Fix is to instead use batch mean as baseline when our group size is 1

Also added advantage stats logging (mean/std/min/max)

@Tiiiger
Copy link
Collaborator

Tiiiger commented Nov 18, 2025

hi @sdan thank you for your contribution! but I don't think this is a common/standard enough technique to be merged into recipe. please fork if needed!

@Tiiiger Tiiiger closed this Nov 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants