You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Constrained Generative Policy Optimization was introduced by Meta in a recent paper (https://arxiv.org/pdf/2409.20370). It seems to outperform PPO and DPO and is specifically designed to address standard RLHF limitations (reward hacking, accurate alignment guidance...).
They introduced 2 options for rlhf optimizers: 1. CalibratedRegularized Policy Gradient (CRPG) and Constrained Online DPO (CODPO). 2. Calibrated-Regularized Reward Ranking Fine-tuning (CRRAFT).
The implementation will likely require the following steps:
Hey @gaetanlop I think this would be a really cool addition to the library and one that is likely to be useful to the community now that online methods are becoming more common!
The best part is that they used open datasets in the paper, so we should be able to verify the implementation by trying to reproduce similar training curves like this:
I suggest starting with small models from Llama or Qwen2.5 and later we can scale up on the HF cluster if needed :)
Hello @lewtun, thank you for your reply. Indeed, we should be able to verify that everything is correct by trying to reproduce there results. The judges seem to be using pretty large models though (llama 70b), so this might be complicated to run it without your cluster (I don't think small models will do the job as judges).
I have implemented the reward part and the mixture of judges in separate PRs, the rest will be done in a single PR normally. I can also close the 2 other PRs and put everything in a single PR if you prefer.
Method description
Constrained Generative Policy Optimization was introduced by Meta in a recent paper (https://arxiv.org/pdf/2409.20370). It seems to outperform PPO and DPO and is specifically designed to address standard RLHF limitations (reward hacking, accurate alignment guidance...).
They introduced 2 options for rlhf optimizers: 1.
CalibratedRegularized Policy Gradient
(CRPG) andConstrained Online DPO
(CODPO). 2.Calibrated-Regularized Reward Ranking Fine-tuning
(CRRAFT).The implementation will likely require the following steps:
I suppose HF is interested to support CGPO but just to be sure, Wdyt @lewtun @kashif ? Any interest in supporting
CGPO
?Contribution
I can work on this. I have already started in #2155. Next steps might be to add the mixture of judges and the CGPO parent class.
The text was updated successfully, but these errors were encountered: