Skip to content

About critic-guide search in InternLM2.5-StepProver #42

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
LuoKaiGSW opened this issue Oct 25, 2024 · 6 comments
Closed

About critic-guide search in InternLM2.5-StepProver #42

LuoKaiGSW opened this issue Oct 25, 2024 · 6 comments

Comments

@LuoKaiGSW
Copy link

Thank you for your work on InternLM2.5-StepProver. I have two questions:

  1. The difference between critic-guide search and BFS lies in using the critic model's score to guide the search process. However, this model uses state information as input, so if the sampling number is 32, each step needs to interact with the Lean compiler up to 32 times, which is a significant time cost. In contrast, BFS only needs to interact with Lean once per step. Is this understanding correct?
  2. For BFS, if an error occurs at a certain step, the search will terminate. But for critic-guide search, to obtain the critic model score, the prerequisite is that the compilation succeeds to get the state information. Therefore, the tactics that result in errors will be discarded. This setup seems somewhat unfair, doesn't it?
@wzj423
Copy link
Collaborator

wzj423 commented Oct 26, 2024

  1. The sampling number is 32 for both Best-First and Critic-Guide. This represents the number of tactic candidates generated (they are "candidates" since they might not work), each of which is then passed to the Lean REPL to obtain the next Lean state. BF then ranks these states by averaging log probability scores, while CG passes these Lean states to the critique model for scoring. Thus, there is no difference in how they interact with Lean.

  2. It depends. Since Lean has a C++ kernel, any exceptions thrown by C++ will instantly terminate the Lean process, regardless of CG/BF. Exceptions thrown by Lean itself indicate that certain tactic candidates cannot work. Such candidates will be discarded, of course. Again, the process is essentially the same for CG and BF.

@bhwangfy
Copy link

bhwangfy commented Oct 28, 2024

Following the question, may I ask what is the template of the prompt for the critic model? I wrote a version according to what the paper mentions, but not sure if it is proper:

prompt = "Which state is closer to 'no goals'?"
answer1 = "no goals"
chat = [
        {"role": "user", "content": prompt},
        {"role": "assistant", "content": answer1},
    ]
critic_model.get_score(tokenizer, chat)

@LuoKaiGSW
Copy link
Author

  1. The sampling number is 32 for both Best-First and Critic-Guide. This represents the number of tactic candidates generated (they are "candidates" since they might not work), each of which is then passed to the Lean REPL to obtain the next Lean state. BF then ranks these states by averaging log probability scores, while CG passes these Lean states to the critique model for scoring. Thus, there is no difference in how they interact with Lean.
  2. It depends. Since Lean has a C++ kernel, any exceptions thrown by C++ will instantly terminate the Lean process, regardless of CG/BF. Exceptions thrown by Lean itself indicate that certain tactic candidates cannot work. Such candidates will be discarded, of course. Again, the process is essentially the same for CG and BF.

Thank you for your reply. I have tried the two points mentioned in the paper:

  1. Using your open-source critic model to score and guide BFS, which indeed achieved positive results on minif2f.
  2. Adding the previous proof to the prompt and deleting some data that seemed problematic to me, but this actually led to a performance decline. Of course, it is possible that I mistakenly deleted some data, so I would like to ask about the following example of the rot_add_assoc theorem. It appears to be proved at step id=2, showing no_goals, so I discarded steps id=3-7. Is this the correct approach?
DX-20241028@2x

@LuoKaiGSW
Copy link
Author

Following the question, may I ask what is the template of the prompt for the critic model? I wrote a version according to what the paper mentions, but not sure if it is proper:

prompt = "Which state is closer to 'no goals'?"
answer1 = "no goals"
chat = [
        {"role": "user", "content": prompt},
        {"role": "assistant", "content": answer1},
    ]
critic_model.get_score(tokenizer, chat)

Yes, I think it is right.

@wzj423
Copy link
Collaborator

wzj423 commented Oct 29, 2024

  1. The sampling number is 32 for both Best-First and Critic-Guide. This represents the number of tactic candidates generated (they are "candidates" since they might not work), each of which is then passed to the Lean REPL to obtain the next Lean state. BF then ranks these states by averaging log probability scores, while CG passes these Lean states to the critique model for scoring. Thus, there is no difference in how they interact with Lean.
  2. It depends. Since Lean has a C++ kernel, any exceptions thrown by C++ will instantly terminate the Lean process, regardless of CG/BF. Exceptions thrown by Lean itself indicate that certain tactic candidates cannot work. Such candidates will be discarded, of course. Again, the process is essentially the same for CG and BF.

Thank you for your reply. I have tried the two points mentioned in the paper:

  1. Using your open-source critic model to score and guide BFS, which indeed achieved positive results on minif2f.
  2. Adding the previous proof to the prompt and deleting some data that seemed problematic to me, but this actually led to a performance decline. Of course, it is possible that I mistakenly deleted some data, so I would like to ask about the following example of the rot_add_assoc theorem. It appears to be proved at step id=2, showing no_goals, so I discarded steps id=3-7. Is this the correct approach?
DX-20241028@2x

The search algorithm stops immediately when "no goals" is met. Regarding the previous proof issue, here are some related discussions: Lean Zulip Chat. It is designed to alleviate the "swinging back and forth" issue. However, we haven't conducted experiments on other base models, so whether this phenomenon is model/approach-agnostic is still unknown.

@LuoKaiGSW
Copy link
Author

  1. The sampling number is 32 for both Best-First and Critic-Guide. This represents the number of tactic candidates generated (they are "candidates" since they might not work), each of which is then passed to the Lean REPL to obtain the next Lean state. BF then ranks these states by averaging log probability scores, while CG passes these Lean states to the critique model for scoring. Thus, there is no difference in how they interact with Lean.
  2. It depends. Since Lean has a C++ kernel, any exceptions thrown by C++ will instantly terminate the Lean process, regardless of CG/BF. Exceptions thrown by Lean itself indicate that certain tactic candidates cannot work. Such candidates will be discarded, of course. Again, the process is essentially the same for CG and BF.

Thank you for your reply. I have tried the two points mentioned in the paper:

  1. Using your open-source critic model to score and guide BFS, which indeed achieved positive results on minif2f.
  2. Adding the previous proof to the prompt and deleting some data that seemed problematic to me, but this actually led to a performance decline. Of course, it is possible that I mistakenly deleted some data, so I would like to ask about the following example of the rot_add_assoc theorem. It appears to be proved at step id=2, showing no_goals, so I discarded steps id=3-7. Is this the correct approach?
DX-20241028@2x

The search algorithm stops immediately when "no goals" is met. Regarding the previous proof issue, here are some related discussions: Lean Zulip Chat. It is designed to alleviate the "swinging back and forth" issue. However, we haven't conducted experiments on other base models, so whether this phenomenon is model/approach-agnostic is still unknown.

Thank you for your reply. I will conduct more experiments to verify this conclusion.

@wzj423 wzj423 closed this as completed Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants