Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Retry on failure functionality #2221

Open
neubig opened this issue Jun 3, 2024 · 11 comments
Open

[Feature]: Retry on failure functionality #2221

neubig opened this issue Jun 3, 2024 · 11 comments
Assignees
Labels
enhancement New feature or request medium effort Estimated medium effort
Milestone

Comments

@neubig
Copy link
Contributor

neubig commented Jun 3, 2024

What problem or use case are you trying to solve?

Sometimes models fail to do their job correctly, and we would benefit from starting all over from the beginning. There are a few examples of this in the agent literature:

  • Aider recently introduced a harness for testing SWE-bench that allows for retries when tests and linting don't pass on swe-bench
  • @Jiayi-Pan has work on Evaluation and Refinement for web agents that use a reward model to judge when a web task has failed, a reset mechanism to return to the beginning, and a method for improving the prompts based on reflexion.
  • @niansong1996 has a method LEVER that uses a learned verifier to rerank code generation execution results.

Describe the UX of the solution you'd like

Ideally, this would be something that could be implemented in a general way, so that we could implement different strategies with a shared interface. For instance:

class ResetStrategy:

   @abstractmethod
   def initialize_state():
       """Take note of the initial state that should be reset too."""
       ...

   @abstractmethod
   def verify(...):
      """This verifies whether the agent has reached a failure state."""
      ...
   
   @abstractmethod
   def reset(...):
      """This performs some sort of reset."""
      ...

   @abstractmethod
   def message_on_reset(...):
      """This creates a message to the agent upon reset (e.g. a task with a prompt based on reflexion)."""
      ...   

Then, when using OpenDevin, we could choose an option that says "retry N times when you get stuck", and select the strategy that is used to do so.

Do you have thoughts on the technical implementation?

The actual reset strategies would vary based on the task. For instance:

  • AiderResetStrategy (code reference):
    • initialize: save the current git commit of the repository commit_id
    • verify: tests+linting pass
    • reset: git checkout commit_id
    • message_on_failure: no-op
  • EvalRefineResetStrategy (code reference):
    • initialize: save the current web page initial_page
    • verify: the reward model is positive
    • reset: goto(initial_page)
    • message_on_failure: reflexion prompt

This could either be integrated into OpenDevin, allowing for retries in the main app as well

Additional context:

@neubig neubig added the enhancement New feature or request label Jun 3, 2024
@neubig neubig changed the title [Feature]: Aider-inspired retries in SWE-Bench evaluation [Feature]: Retry on failure functionality Jun 4, 2024
@Jiayi-Pan
Copy link
Contributor

Thanks for creating the issue!
Although I don’t have much spare bandwidth recently, I am definitely interested in bringing EvalRefineResetStrategy and the retry functionality into OpenDevin. I will keep an eye on this PR and contribute once I have the time.

@mamoodi mamoodi added the medium effort Estimated medium effort label Jul 6, 2024
Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale Inactive for 30 days label Aug 10, 2024
@xingyaoww xingyaoww removed the Stale Inactive for 30 days label Aug 13, 2024
Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale Inactive for 30 days label Sep 13, 2024
@enyst enyst removed the Stale Inactive for 30 days label Sep 13, 2024
@Vaishakh-SM
Copy link
Contributor

Hi!
Is anyone working on this?

@neubig
Copy link
Contributor Author

neubig commented Oct 5, 2024

Hey @Vaishakh-SM , I think nobody is working on this, but @xingyaoww was thinking about adding multiple runs to evaluation. I think that would be a parallel effort though, because it would involve running multiple times and picking the best one, as opposed to restarting when the first try didn't work.

If you'd be interested in taking a look it'd be welcome!

@Vaishakh-SM
Copy link
Contributor

This seems like an interesting problem!

I'll take a look and get back to this sometime this week.

Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale Inactive for 30 days label Nov 11, 2024
@xingyaoww xingyaoww removed the Stale Inactive for 30 days label Nov 11, 2024
@mamoodi
Copy link
Collaborator

mamoodi commented Dec 5, 2024

@neubig this is a really old issue. Just want to make sure, we haven't implemented this yet, right?

@neubig
Copy link
Contributor Author

neubig commented Dec 5, 2024

Yep, @xingyaoww is working on a critic that could help implement this.

Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale Inactive for 30 days label Jan 27, 2025
@xingyaoww xingyaoww removed the Stale Inactive for 30 days label Jan 27, 2025
@xingyaoww xingyaoww modified the milestones: 2025-01, 2025-02 Jan 31, 2025
@rbren rbren modified the milestones: 2025-02, 2025-03 Feb 14, 2025
@manzke
Copy link

manzke commented Feb 25, 2025

Using OpenHands now for a whole project, I just can say a retry in general would be nice. I've seen several failures, when things could not be replaced or far more often, if they could not be generated because of rate limits of the models. The bigger your codebase gets, the more tokens are used and the more often you hit the rate limits.
This led to several cases, where files had been deleted, but couldn't be generated again. Happy to share more insights.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request medium effort Estimated medium effort
Projects
Status: In Progress
Development

No branches or pull requests

8 participants