You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I read the RLOO paper from Cohere , which claims that PPO (clipping and importance sampling) is unnecessary for RLHF and plain policy gradient with multiple samples can do the trick.
When reading the code (referenced in the paper), it seems that PPO loss is indeed used (
Hi,
I read the RLOO paper from Cohere , which claims that PPO (clipping and importance sampling) is unnecessary for RLHF and plain policy gradient with multiple samples can do the trick.
When reading the code (referenced in the paper), it seems that PPO loss is indeed used (
trl/trl/trainer/rloo_trainer.py
Line 392 in 0238d96
The text was updated successfully, but these errors were encountered: