Ask a question #21

Curious-L · 2024-03-19T07:37:15Z

Excuse me, I've recently been studying the excellent work of your team and I have a doubt:
My understanding of the whole process is that it first goes through step (a), then step (b) which generates the prompt, followed by the LLM in part (c) outputting results (such as what action to take next) through the prompt. By the final step (d), the system calls the PPO algorithm separately for policy generation and at the same time compares it with the output of the LLM. So, I think what the PPO is fine-tuning is actually the output of the LLM, but the description in the paper seems to indicate that the LLM itself was fine-tuned using PPO. This is where I'm not sure. Would you mind clarifying this for me?
Thank you very much!

ClementRomac · 2024-03-19T08:49:19Z

Hey, the figure indeed summarizes the process from a high level perspective but the details in the paper are what is really happening: we do fine-tune the whole LLM. So, to be precise, given the description returned by the environment and the goal, we construct a prompt (this is hardcoded). Then, we give this prompt to the LLM and compute the log probabilities of all the possible actions to follow this prompt. This is the policy (hence the LLM) and we sample actions according to these log probabilities. After collecting N steps, we compute the PPO loss and fine-tune the whole LLM according to it.
Let me know if there is anything still unclear. I can also point you to pieces of code that may help understand what is really happening.

Curious-L · 2024-03-19T08:57:05Z

Thanks！
Could you kindly inform me about the computational resources required for fine-tuning, including the size of the dataset, tokens, and duration for completing an experiment (across several iterations)? Additionally, I would like to know the minimum resources needed for reproducing experiments on a small scale. Thank you very much!

ClementRomac · 2024-03-29T08:09:34Z

Hi,

Details concerning computational resources can be found in the end of Appendix E of our paper: https://arxiv.org/abs/2302.02662.

We did not report the number of tokens and there is no dataset when using GLAM (i.e. Online RL).
Hope this helps.

ClementRomac self-assigned this Mar 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ask a question #21

Ask a question #21

Curious-L commented Mar 19, 2024 •

edited

Loading

ClementRomac commented Mar 19, 2024 •

edited

Loading

Curious-L commented Mar 19, 2024

ClementRomac commented Mar 29, 2024

Ask a question #21

Ask a question #21

Comments

Curious-L commented Mar 19, 2024 • edited Loading

ClementRomac commented Mar 19, 2024 • edited Loading

Curious-L commented Mar 19, 2024

ClementRomac commented Mar 29, 2024

Curious-L commented Mar 19, 2024 •

edited

Loading

ClementRomac commented Mar 19, 2024 •

edited

Loading