Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the difference between reward_model_type=='vm' and reward_model_type=='prm'? #20

Open
shawnye2000 opened this issue Jan 10, 2025 · 1 comment

Comments

@shawnye2000
Copy link

No description provided.

@shawnye2000 shawnye2000 changed the title What is the difference for reward_model_type=='vm' and reward_model_type=='prm'? What is the difference between reward_model_type=='vm' and reward_model_type=='prm'? Jan 10, 2025
@Xingxiangrui
Copy link

Xingxiangrui commented Jan 15, 2025

+1 for this

In my opinion, the VM phase can be approximately regarded as the training phase. During this phase, the MCTS (Monte Carlo Tree Search) algorithm mainly focuses on exploration and simulates to find the optimal path. This can be seen as a process of exploration and labeling.
在我看来,VM 阶段可以近似看作是训练阶段。在这个阶段中,MCTS(蒙特卡洛树搜索)算法主要进行探索,并通过模拟找出最佳路径。这可以被视为一个探索和打标签的过程。

On the other hand, the PRM phase resembles a pure inference process. In this phase, the algorithm only needs to explore the path without making any path selection.
而 PRM 阶段则更像是一个纯推理阶段。在这个阶段中,算法只需要将路径探索出来即可,并没有进行路径的选择。

This may not be entirely accurate. I welcome any additions or discussions!
未必完全准确,欢迎大家补充和交流!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants