Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CGPO] Add support for Constrained Generative Policy Optimization #2156

Open
3 tasks
gaetanlop opened this issue Oct 2, 2024 · 2 comments
Open
3 tasks

[CGPO] Add support for Constrained Generative Policy Optimization #2156

gaetanlop opened this issue Oct 2, 2024 · 2 comments
Assignees
Labels
✨ enhancement New feature or request

Comments

@gaetanlop
Copy link
Contributor

gaetanlop commented Oct 2, 2024

Method description

Constrained Generative Policy Optimization was introduced by Meta in a recent paper (https://arxiv.org/pdf/2409.20370). It seems to outperform PPO and DPO and is specifically designed to address standard RLHF limitations (reward hacking, accurate alignment guidance...).

They introduced 2 options for rlhf optimizers: 1. CalibratedRegularized Policy Gradient (CRPG) and Constrained Online DPO (CODPO). 2. Calibrated-Regularized Reward Ranking Fine-tuning (CRRAFT).

The implementation will likely require the following steps:

I suppose HF is interested to support CGPO but just to be sure, Wdyt @lewtun @kashif ? Any interest in supporting CGPO?

Contribution

I can work on this. I have already started in #2155. Next steps might be to add the mixture of judges and the CGPO parent class.

@gaetanlop gaetanlop mentioned this issue Oct 2, 2024
4 tasks
@lewtun
Copy link
Member

lewtun commented Oct 2, 2024

Hey @gaetanlop I think this would be a really cool addition to the library and one that is likely to be useful to the community now that online methods are becoming more common!

The best part is that they used open datasets in the paper, so we should be able to verify the implementation by trying to reproduce similar training curves like this:

Screenshot 2024-10-02 at 18 18 46

I suggest starting with small models from Llama or Qwen2.5 and later we can scale up on the HF cluster if needed :)

@gaetanlop
Copy link
Contributor Author

gaetanlop commented Oct 3, 2024

Hello @lewtun, thank you for your reply. Indeed, we should be able to verify that everything is correct by trying to reproduce there results. The judges seem to be using pretty large models though (llama 70b), so this might be complicated to run it without your cluster (I don't think small models will do the job as judges).

I have implemented the reward part and the mixture of judges in separate PRs, the rest will be done in a single PR normally. I can also close the 2 other PRs and put everything in a single PR if you prefer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
✨ enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants