Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

咨询一下RLHF算法的细节 #19

Open
WilfChen opened this issue Mar 3, 2023 · 2 comments
Open

咨询一下RLHF算法的细节 #19

WilfChen opened this issue Mar 3, 2023 · 2 comments

Comments

@WilfChen
Copy link

WilfChen commented Mar 3, 2023

通常PPO算法需要收集一个episode的数据,计算整个episode的DiscountedReturn/Advantage/GAE,用来更新Critic
在情感分析或者对话任务中,一个episode是什么?

@HarderThenHarder
Copy link
Owner

HarderThenHarder commented Mar 3, 2023

Hi,通常我们把一个句子当成一个episode。

以对话系统为例,我们将生成一个句子看作是一个「序列决策」任务,每生成一个字可以看成一个 RL 中的 step。

在每一个 step 下:将先前已经生成的字(history state)作为 observation,action 是在所有词表中选择一个确定的词,action space 为词表大小(vocab_size)。

按照上述逻辑,生成一个完整的句子就是一条完整的 trajectory,包含了每一个 step 下的 action,那么 total reward 就是对这个句子的评分,这样就可以计算每一个 step action 的 discount reward(即句子序列中每一个字生成的 reward )。

希望能够解答你的问题 :)

@WilfChen
Copy link
Author

WilfChen commented Mar 3, 2023

解释的非常清楚,非常感谢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants