About critic-guide search in InternLM2.5-StepProver #42

LuoKaiGSW · 2024-10-25T09:14:58Z

Thank you for your work on InternLM2.5-StepProver. I have two questions:

The difference between critic-guide search and BFS lies in using the critic model's score to guide the search process. However, this model uses state information as input, so if the sampling number is 32, each step needs to interact with the Lean compiler up to 32 times, which is a significant time cost. In contrast, BFS only needs to interact with Lean once per step. Is this understanding correct?
For BFS, if an error occurs at a certain step, the search will terminate. But for critic-guide search, to obtain the critic model score, the prerequisite is that the compilation succeeds to get the state information. Therefore, the tactics that result in errors will be discarded. This setup seems somewhat unfair, doesn't it?

wzj423 · 2024-10-26T02:13:08Z

The sampling number is 32 for both Best-First and Critic-Guide. This represents the number of tactic candidates generated (they are "candidates" since they might not work), each of which is then passed to the Lean REPL to obtain the next Lean state. BF then ranks these states by averaging log probability scores, while CG passes these Lean states to the critique model for scoring. Thus, there is no difference in how they interact with Lean.
It depends. Since Lean has a C++ kernel, any exceptions thrown by C++ will instantly terminate the Lean process, regardless of CG/BF. Exceptions thrown by Lean itself indicate that certain tactic candidates cannot work. Such candidates will be discarded, of course. Again, the process is essentially the same for CG and BF.

bhwangfy · 2024-10-28T02:37:56Z

Following the question, may I ask what is the template of the prompt for the critic model? I wrote a version according to what the paper mentions, but not sure if it is proper:

prompt = "Which state is closer to 'no goals'?"
answer1 = "no goals"
chat = [
        {"role": "user", "content": prompt},
        {"role": "assistant", "content": answer1},
    ]
critic_model.get_score(tokenizer, chat)

LuoKaiGSW · 2024-10-28T07:07:09Z

The sampling number is 32 for both Best-First and Critic-Guide. This represents the number of tactic candidates generated (they are "candidates" since they might not work), each of which is then passed to the Lean REPL to obtain the next Lean state. BF then ranks these states by averaging log probability scores, while CG passes these Lean states to the critique model for scoring. Thus, there is no difference in how they interact with Lean.

It depends. Since Lean has a C++ kernel, any exceptions thrown by C++ will instantly terminate the Lean process, regardless of CG/BF. Exceptions thrown by Lean itself indicate that certain tactic candidates cannot work. Such candidates will be discarded, of course. Again, the process is essentially the same for CG and BF.

Thank you for your reply. I have tried the two points mentioned in the paper:

Using your open-source critic model to score and guide BFS, which indeed achieved positive results on minif2f.
Adding the previous proof to the prompt and deleting some data that seemed problematic to me, but this actually led to a performance decline. Of course, it is possible that I mistakenly deleted some data, so I would like to ask about the following example of the rot_add_assoc theorem. It appears to be proved at step id=2, showing no_goals, so I discarded steps id=3-7. Is this the correct approach?

LuoKaiGSW · 2024-10-28T07:10:16Z

Following the question, may I ask what is the template of the prompt for the critic model? I wrote a version according to what the paper mentions, but not sure if it is proper:
prompt = "Which state is closer to 'no goals'?"
answer1 = "no goals"
chat = [
        {"role": "user", "content": prompt},
        {"role": "assistant", "content": answer1},
    ]
critic_model.get_score(tokenizer, chat)

Yes, I think it is right.

wzj423 · 2024-10-29T01:35:38Z

The sampling number is 32 for both Best-First and Critic-Guide. This represents the number of tactic candidates generated (they are "candidates" since they might not work), each of which is then passed to the Lean REPL to obtain the next Lean state. BF then ranks these states by averaging log probability scores, while CG passes these Lean states to the critique model for scoring. Thus, there is no difference in how they interact with Lean.

It depends. Since Lean has a C++ kernel, any exceptions thrown by C++ will instantly terminate the Lean process, regardless of CG/BF. Exceptions thrown by Lean itself indicate that certain tactic candidates cannot work. Such candidates will be discarded, of course. Again, the process is essentially the same for CG and BF.

Thank you for your reply. I have tried the two points mentioned in the paper:

Using your open-source critic model to score and guide BFS, which indeed achieved positive results on minif2f.

Adding the previous proof to the prompt and deleting some data that seemed problematic to me, but this actually led to a performance decline. Of course, it is possible that I mistakenly deleted some data, so I would like to ask about the following example of the rot_add_assoc theorem. It appears to be proved at step id=2, showing no_goals, so I discarded steps id=3-7. Is this the correct approach?

The search algorithm stops immediately when "no goals" is met. Regarding the previous proof issue, here are some related discussions: Lean Zulip Chat. It is designed to alleviate the "swinging back and forth" issue. However, we haven't conducted experiments on other base models, so whether this phenomenon is model/approach-agnostic is still unknown.

LuoKaiGSW · 2024-10-30T03:38:13Z

The sampling number is 32 for both Best-First and Critic-Guide. This represents the number of tactic candidates generated (they are "candidates" since they might not work), each of which is then passed to the Lean REPL to obtain the next Lean state. BF then ranks these states by averaging log probability scores, while CG passes these Lean states to the critique model for scoring. Thus, there is no difference in how they interact with Lean.

It depends. Since Lean has a C++ kernel, any exceptions thrown by C++ will instantly terminate the Lean process, regardless of CG/BF. Exceptions thrown by Lean itself indicate that certain tactic candidates cannot work. Such candidates will be discarded, of course. Again, the process is essentially the same for CG and BF.

Thank you for your reply. I have tried the two points mentioned in the paper:

Using your open-source critic model to score and guide BFS, which indeed achieved positive results on minif2f.

Adding the previous proof to the prompt and deleting some data that seemed problematic to me, but this actually led to a performance decline. Of course, it is possible that I mistakenly deleted some data, so I would like to ask about the following example of the rot_add_assoc theorem. It appears to be proved at step id=2, showing no_goals, so I discarded steps id=3-7. Is this the correct approach?

The search algorithm stops immediately when "no goals" is met. Regarding the previous proof issue, here are some related discussions: Lean Zulip Chat. It is designed to alleviate the "swinging back and forth" issue. However, we haven't conducted experiments on other base models, so whether this phenomenon is model/approach-agnostic is still unknown.

Thank you for your reply. I will conduct more experiments to verify this conclusion.

wzj423 closed this as completed Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About critic-guide search in InternLM2.5-StepProver #42

About critic-guide search in InternLM2.5-StepProver #42

LuoKaiGSW commented Oct 25, 2024

wzj423 commented Oct 26, 2024

bhwangfy commented Oct 28, 2024 •

edited

Loading

LuoKaiGSW commented Oct 28, 2024

LuoKaiGSW commented Oct 28, 2024

wzj423 commented Oct 29, 2024

LuoKaiGSW commented Oct 30, 2024

About critic-guide search in InternLM2.5-StepProver #42

About critic-guide search in InternLM2.5-StepProver #42

Comments

LuoKaiGSW commented Oct 25, 2024

wzj423 commented Oct 26, 2024

bhwangfy commented Oct 28, 2024 • edited Loading

LuoKaiGSW commented Oct 28, 2024

LuoKaiGSW commented Oct 28, 2024

wzj423 commented Oct 29, 2024

LuoKaiGSW commented Oct 30, 2024

bhwangfy commented Oct 28, 2024 •

edited

Loading