What is the difference between reward_model_type=='vm' and reward_model_type=='prm'? #20

shawnye2000 · 2025-01-10T09:17:50Z

No description provided.

Xingxiangrui · 2025-01-15T09:20:58Z

+1 for this

In my opinion, the VM phase can be approximately regarded as the training phase. During this phase, the MCTS (Monte Carlo Tree Search) algorithm mainly focuses on exploration and simulates to find the optimal path. This can be seen as a process of exploration and labeling.
在我看来，VM 阶段可以近似看作是训练阶段。在这个阶段中，MCTS（蒙特卡洛树搜索）算法主要进行探索，并通过模拟找出最佳路径。这可以被视为一个探索和打标签的过程。

On the other hand, the PRM phase resembles a pure inference process. In this phase, the algorithm only needs to explore the path without making any path selection.
而 PRM 阶段则更像是一个纯推理阶段。在这个阶段中，算法只需要将路径探索出来即可，并没有进行路径的选择。

This may not be entirely accurate. I welcome any additions or discussions!
未必完全准确，欢迎大家补充和交流！

shawnye2000 changed the title ~~What is the difference for reward_model_type=='vm' and reward_model_type=='prm'?~~ What is the difference between reward_model_type=='vm' and reward_model_type=='prm'? Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the difference between reward_model_type=='vm' and reward_model_type=='prm'? #20

What is the difference between reward_model_type=='vm' and reward_model_type=='prm'? #20

shawnye2000 commented Jan 10, 2025

Xingxiangrui commented Jan 15, 2025 •

edited

Loading

What is the difference between reward_model_type=='vm' and reward_model_type=='prm'? #20

What is the difference between reward_model_type=='vm' and reward_model_type=='prm'? #20

Comments

shawnye2000 commented Jan 10, 2025

Xingxiangrui commented Jan 15, 2025 • edited Loading

Xingxiangrui commented Jan 15, 2025 •

edited

Loading