[Question] Why is Importance Sampling and Clipping applied in RLOO? #2341

shashankg7 · 2024-11-10T20:00:02Z

Hi,

I read the RLOO paper from Cohere , which claims that PPO (clipping and importance sampling) is unnecessary for RLHF and plain policy gradient with multiple samples can do the trick.

When reading the code (referenced in the paper), it seems that PPO loss is indeed used (

trl/trl/trainer/rloo_trainer.py

Line 392 in 0238d96

# Do multiple epochs of PPO training, with a fresh random shuffle in each epoch

). Any reason for the divergence from the original claim in the paper? Just trying to understand the implementation details here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Why is Importance Sampling and Clipping applied in RLOO? #2341

[Question] Why is Importance Sampling and Clipping applied in RLOO? #2341

shashankg7 commented Nov 10, 2024

[Question] Why is Importance Sampling and Clipping applied in RLOO? #2341

[Question] Why is Importance Sampling and Clipping applied in RLOO? #2341

Comments

shashankg7 commented Nov 10, 2024