[CGPO] Add support for Constrained Generative Policy Optimization #2156

gaetanlop · 2024-10-02T02:36:33Z

Method description

Constrained Generative Policy Optimization was introduced by Meta in a recent paper (https://arxiv.org/pdf/2409.20370). It seems to outperform PPO and DPO and is specifically designed to address standard RLHF limitations (reward hacking, accurate alignment guidance...).

They introduced 2 options for rlhf optimizers: 1. CalibratedRegularized Policy Gradient (CRPG) and Constrained Online DPO (CODPO). 2. Calibrated-Regularized Reward Ranking Fine-tuning (CRRAFT).

The implementation will likely require the following steps:

Calibrated reward modeling ([CGPO] Calibrated reward #2155)
A Mixture of judges ([CGPO] Mixture of judges #2159)
CGPO Trainer as described in section 4.1 of the paper. ([CGPO] CGPO Trainer (single task single objective) #2190)

I suppose HF is interested to support CGPO but just to be sure, Wdyt @lewtun @kashif ? Any interest in supporting CGPO?

Contribution

I can work on this. I have already started in #2155. Next steps might be to add the mixture of judges and the CGPO parent class.

The text was updated successfully, but these errors were encountered:

lewtun · 2024-10-02T16:19:34Z

Hey @gaetanlop I think this would be a really cool addition to the library and one that is likely to be useful to the community now that online methods are becoming more common!

The best part is that they used open datasets in the paper, so we should be able to verify the implementation by trying to reproduce similar training curves like this:

I suggest starting with small models from Llama or Qwen2.5 and later we can scale up on the HF cluster if needed :)

gaetanlop · 2024-10-03T03:47:14Z

Hello @lewtun, thank you for your reply. Indeed, we should be able to verify that everything is correct by trying to reproduce there results. The judges seem to be using pretty large models though (llama 70b), so this might be complicated to run it without your cluster (I don't think small models will do the job as judges).

I have implemented the reward part and the mixture of judges in separate PRs, the rest will be done in a single PR normally. I can also close the 2 other PRs and put everything in a single PR if you prefer.

gaetanlop mentioned this issue Oct 2, 2024

[CGPO] Calibrated reward #2155

Closed

4 tasks

gaetanlop mentioned this issue Oct 3, 2024

[CGPO] Mixture of judges #2159

Open

4 tasks

qgallouedec added the ✨ enhancement New feature or request label Oct 3, 2024

qgallouedec assigned gaetanlop Oct 3, 2024

gaetanlop mentioned this issue Oct 6, 2024

[CGPO] CGPO Trainer (single task single objective) #2190

Draft

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CGPO] Add support for Constrained Generative Policy Optimization #2156

[CGPO] Add support for Constrained Generative Policy Optimization #2156

gaetanlop commented Oct 2, 2024 •

edited

Loading

lewtun commented Oct 2, 2024

gaetanlop commented Oct 3, 2024 •

edited

Loading

[CGPO] Add support for Constrained Generative Policy Optimization #2156

[CGPO] Add support for Constrained Generative Policy Optimization #2156

Comments

gaetanlop commented Oct 2, 2024 • edited Loading

Method description

Contribution

lewtun commented Oct 2, 2024

gaetanlop commented Oct 3, 2024 • edited Loading

gaetanlop commented Oct 2, 2024 •

edited

Loading

gaetanlop commented Oct 3, 2024 •

edited

Loading