-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ask a question #21
Comments
Hey, the figure indeed summarizes the process from a high level perspective but the details in the paper are what is really happening: we do fine-tune the whole LLM. So, to be precise, given the description returned by the environment and the goal, we construct a prompt (this is hardcoded). Then, we give this prompt to the LLM and compute the log probabilities of all the possible actions to follow this prompt. This is the policy (hence the LLM) and we sample actions according to these log probabilities. After collecting N steps, we compute the PPO loss and fine-tune the whole LLM according to it. |
Thanks! |
Hi, Details concerning computational resources can be found in the end of Appendix E of our paper: https://arxiv.org/abs/2302.02662. We did not report the number of tokens and there is no dataset when using GLAM (i.e. Online RL). |
Excuse me, I've recently been studying the excellent work of your team and I have a doubt:
My understanding of the whole process is that it first goes through step (a), then step (b) which generates the prompt, followed by the LLM in part (c) outputting results (such as what action to take next) through the prompt. By the final step (d), the system calls the PPO algorithm separately for policy generation and at the same time compares it with the output of the LLM. So, I think what the PPO is fine-tuning is actually the output of the LLM, but the description in the paper seems to indicate that the LLM itself was fine-tuned using PPO. This is where I'm not sure. Would you mind clarifying this for me?
Thank you very much!
The text was updated successfully, but these errors were encountered: