Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue #5076: Integration test github action #5077

Merged
merged 38 commits into from
Nov 27, 2024
Merged

Conversation

openhands-agent
Copy link
Contributor

@openhands-agent openhands-agent commented Nov 16, 2024

This PR proposes re-addition/fixes to integration tests:

  • refactor them out of the eval-this workflow
  • running nightly on Haiku and Deepseek, and on label integration-test
  • the reason for both Haiku and Deepseek is mainly:
    • they're cheap LLMs
    • Haiku has native function calling
    • Deepseek doesn't - which allows us to see that/how both types work.
  • tested on Integration tests runs enyst/playground#9
  • fixes to the tests.

Example latest run last night:

image

Please note that Haiku and Deepseek would need to be set up, if we want to do this here. Deepseek has been set up in the past, but it's not working (key depleted?).
Just please let me know which LLMs or how do you think we could we do something like this here. Meanwhile, you can see them chilling just fine in my playground. 😅


Original openhands-agent message

This pull request fixes #5076.

This issue has been successfully resolved. The PR effectively:

  1. Separated the large eval-runner workflow into two distinct workflows:

    • A new integration-runner.yml for integration tests
    • The original eval-runner.yml now focused solely on SWE-Bench evaluation
  2. Properly configured the triggers for the new integration-runner workflow to only execute:

    • When a PR is labeled with 'integration-test'
    • When manually triggered via workflow dispatch
  3. Successfully moved all necessary prerequisites and configuration sections related to integration testing to the new workflow file while maintaining the functionality.

The solution meets all requirements specified in the original issue, particularly the important condition that the action should only run when labeled or manually triggered. The separation of concerns will make the workflows more maintainable and focused on their specific tasks.

Recommended for review and merge as this represents a clean separation of the evaluation workflows without changing their core functionality.

Automatic fix generated by OpenHands 🙌


To run this PR locally, use the following command:

docker run -it --rm   -p 3000:3000   -v /var/run/docker.sock:/var/run/docker.sock   --add-host host.docker.internal:host-gateway   -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:e5b5bf0-nikolaik   --name openhands-app-e5b5bf0   docker.all-hands.dev/all-hands-ai/openhands:e5b5bf0

Copy link
Contributor

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

@enyst
Copy link
Collaborator

enyst commented Nov 22, 2024

@openhands-agent The last run-integration-tests job, which was added by this PR, has failed. Use the github API to retrieve and read the job logs, and fix it.

Copy link
Contributor

OpenHands started fixing the pr! You can monitor the progress here.

Copy link
Contributor

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

Copy link
Contributor

Trigger by: Pull Request (integration-test label on PR #5077)
Commit: 5a902c4
Integration Tests Evaluation Report
Success rate: 0.00% (0/6)

instance_id success reason
t02_add_bash_hello False Failed to cat /workspace/hello.sh: cat: /workspace/hello.sh: No such file or directory.
t06_github_pr_browsing False The answer is not found in any message. Total messages: 0. Messages: []
t04_git_staging False Failed to check for "nothing to commit, working tree clean": On branch master
No commits yet
Changes to be committed:
(use "git rm --cached ..." to unstage)
new file: hello.py.
t05_simple_browsing False The answer is not found in any message. Total messages: 0. Messages: []
t01_fix_simple_typo False File not fixed: This is a stupid typoo.
Really?
No mor typos!
Enjoy!
t03_jupyter_write_file False Failed to cat /workspace/test.txt: cat: /workspace/test.txt: No such file or directory.

You can download the full evaluation outputs here.

@mamoodi
Copy link
Collaborator

mamoodi commented Nov 25, 2024

I think enyst has eyes on this.

@enyst
Copy link
Collaborator

enyst commented Nov 25, 2024

Yes! Browsing works too, now:

enyst#8 (comment)

@enyst enyst marked this pull request as ready for review November 26, 2024 23:36
Copy link
Collaborator

@xingyaoww xingyaoww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@mamoodi
Copy link
Collaborator

mamoodi commented Nov 27, 2024

Looks good! Just had a question for my own curiosity. Do you know the cost associated with running these integration tests daily for haiku and deepseek?

@mamoodi
Copy link
Collaborator

mamoodi commented Nov 27, 2024

Also did you get Haiku and Deepseek setup? Referring to the description:
"Please note that Haiku and Deepseek would need to be set up, if we want to do this here. Deepseek has been set up in the past, but it's not working (key depleted?)."

Copy link
Contributor

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

Copy link
Contributor

Trigger by: Pull Request (integration-test label on PR #5077)
Commit: f8cc71d
Integration Tests Report (Haiku)
Haiku LLM Test Results:
Success rate: 0.00% (0/6)

instance_id success reason
t06_github_pr_browsing False The answer is not found in any message. Total messages: 1.
t01_fix_simple_typo False File not fixed: This is a stupid typoo.
Really?
No mor typos!
Enjoy!
t05_simple_browsing False The answer is not found in any message. Total messages: 1.
t03_jupyter_write_file False Failed to cat /workspace/test.txt: cat: /workspace/test.txt: No such file or directory.
t02_add_bash_hello False Failed to cat /workspace/hello.sh: cat: /workspace/hello.sh: No such file or directory.
t04_git_staging False Failed to check for "nothing to commit, working tree clean": On branch master
No commits yet
Changes to be committed:
(use "git rm --cached ..." to unstage)
new file: hello.py.

Integration Tests Report (DeepSeek)
DeepSeek LLM Test Results:
Success rate: 0.00% (0/6)

instance_id success reason
t05_simple_browsing False The answer is not found in any message. Total messages: 1.
t01_fix_simple_typo False File not fixed: This is a stupid typoo.
Really?
No mor typos!
Enjoy!
t02_add_bash_hello False Failed to cat /workspace/hello.sh: cat: /workspace/hello.sh: No such file or directory.
t06_github_pr_browsing False The answer is not found in any message. Total messages: 1.
t04_git_staging False Failed to check for "nothing to commit, working tree clean": On branch master
No commits yet
Changes to be committed:
(use "git rm --cached ..." to unstage)
new file: hello.py.
t03_jupyter_write_file False Failed to cat /workspace/test.txt: cat: /workspace/test.txt: No such file or directory.

Download evaluation outputs (includes both Haiku and DeepSeek results): Download

Copy link
Contributor

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

Copy link
Contributor

Trigger by: Pull Request (integration-test label on PR #5077)
Commit: f4553f5
Integration Tests Report (Haiku)
Haiku LLM Test Results:
Success rate: 100.00% (6/6)

instance_id success reason
t03_jupyter_write_file True
t01_fix_simple_typo True
t02_add_bash_hello True
t05_simple_browsing True
t06_github_pr_browsing True
t04_git_staging True

Integration Tests Report (DeepSeek)
DeepSeek LLM Test Results:
Success rate: 0.00% (0/6)

instance_id success reason
t06_github_pr_browsing False The answer is not found in any message. Total messages: 1.
t03_jupyter_write_file False Failed to cat /workspace/test.txt: cat: /workspace/test.txt: No such file or directory.
t02_add_bash_hello False Failed to cat /workspace/hello.sh: cat: /workspace/hello.sh: No such file or directory.
t01_fix_simple_typo False File not fixed: This is a stupid typoo.
Really?
No mor typos!
Enjoy!
t05_simple_browsing False The answer is not found in any message. Total messages: 1.
t04_git_staging False Failed to check for "nothing to commit, working tree clean": On branch master
No commits yet
Changes to be committed:
(use "git rm --cached ..." to unstage)
new file: hello.py.

Download evaluation outputs (includes both Haiku and DeepSeek results): Download

@enyst
Copy link
Collaborator

enyst commented Nov 27, 2024

Looks good! Just had a question for my own curiosity. Do you know the cost associated with running these integration tests daily for haiku and deepseek?

They're 6 tasks, each limited to 10 iterations max. 2/6 go on the internet. Nevertheless, the costs I'm seeing with Haiku are USD 0.02-0.05 per task.

Deepseek is lower.

I will fix this to display them all by default. I think it looks super reasonable, though?

@enyst enyst merged commit f0ca223 into main Nov 27, 2024
23 checks passed
@enyst enyst deleted the openhands-fix-issue-5076 branch November 27, 2024 20:31
Copy link
Contributor

Trigger by: Nightly Scheduled Run
Commit: 4374b4a
Integration Tests Report (Haiku)
Haiku LLM Test Results:
Success rate: 100.00% (6/6)

instance_id success reason
t04_git_staging True
t03_jupyter_write_file True
t01_fix_simple_typo True
t06_github_pr_browsing True
t02_add_bash_hello True
t05_simple_browsing True

Integration Tests Report (DeepSeek)
DeepSeek LLM Test Results:
Success rate: 0.00% (0/6)

instance_id success reason
t06_github_pr_browsing False The answer is not found in any message. Total messages: 1.
t02_add_bash_hello False Failed to cat /workspace/hello.sh: cat: /workspace/hello.sh: No such file or directory.
t01_fix_simple_typo False File not fixed: This is a stupid typoo.
Really?
No mor typos!
Enjoy!
t03_jupyter_write_file False Failed to cat /workspace/test.txt: cat: /workspace/test.txt: No such file or directory.
t05_simple_browsing False The answer is not found in any message. Total messages: 1.
t04_git_staging False Failed to check for "nothing to commit, working tree clean": On branch master
No commits yet
Changes to be committed:
(use "git rm --cached ..." to unstage)
new file: hello.py.

Download evaluation outputs (includes both Haiku and DeepSeek results): Download

Copy link
Contributor

Trigger by: Nightly Scheduled Run
Commit: 59532c9
Integration Tests Report (Haiku)
Haiku LLM Test Results:
Success rate: 100.00% (6/6)

instance_id success reason
t04_git_staging True
t03_jupyter_write_file True
t01_fix_simple_typo True
t06_github_pr_browsing True
t05_simple_browsing True
t02_add_bash_hello True

Integration Tests Report (DeepSeek)
DeepSeek LLM Test Results:
Success rate: 0.00% (0/6)

instance_id success reason
t05_simple_browsing False The answer is not found in any message. Total messages: 1.
t02_add_bash_hello False Failed to cat /workspace/hello.sh: cat: /workspace/hello.sh: No such file or directory.
t06_github_pr_browsing False The answer is not found in any message. Total messages: 1.
t03_jupyter_write_file False Failed to cat /workspace/test.txt: cat: /workspace/test.txt: No such file or directory.
t01_fix_simple_typo False File not fixed: This is a stupid typoo.
Really?
No mor typos!
Enjoy!
t04_git_staging False Failed to check for "nothing to commit, working tree clean": On branch master
No commits yet
Changes to be committed:
(use "git rm --cached ..." to unstage)
new file: hello.py.

Download evaluation outputs (includes both Haiku and DeepSeek results): Download

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Integration test github action
4 participants