Fix issue #5076: Integration test github action #5077

openhands-agent · 2024-11-16T00:39:40Z

This PR proposes re-addition/fixes to integration tests:

refactor them out of the eval-this workflow
running nightly on Haiku and Deepseek, and on label integration-test
the reason for both Haiku and Deepseek is mainly:
- they're cheap LLMs
- Haiku has native function calling
- Deepseek doesn't - which allows us to see that/how both types work.
tested on Integration tests runs enyst/playground#9
fixes to the tests.

Example latest run last night:

Please note that Haiku and Deepseek would need to be set up, if we want to do this here. Deepseek has been set up in the past, but it's not working (key depleted?).
Just please let me know which LLMs or how do you think we could we do something like this here. Meanwhile, you can see them chilling just fine in my playground. 😅

Original openhands-agent message

This pull request fixes #5076.

This issue has been successfully resolved. The PR effectively:

Separated the large eval-runner workflow into two distinct workflows:
- A new integration-runner.yml for integration tests
- The original eval-runner.yml now focused solely on SWE-Bench evaluation
Properly configured the triggers for the new integration-runner workflow to only execute:
- When a PR is labeled with 'integration-test'
- When manually triggered via workflow dispatch
Successfully moved all necessary prerequisites and configuration sections related to integration testing to the new workflow file while maintaining the functionality.

The solution meets all requirements specified in the original issue, particularly the important condition that the action should only run when labeled or manually triggered. The separation of concerns will make the workflows more maintainable and focused on their specific tasks.

Recommended for review and merge as this represents a clean separation of the evaluation workflows without changing their core functionality.

Automatic fix generated by OpenHands 🙌

To run this PR locally, use the following command:

docker run -it --rm   -p 3000:3000   -v /var/run/docker.sock:/var/run/docker.sock   --add-host host.docker.internal:host-gateway   -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:e5b5bf0-nikolaik   --name openhands-app-e5b5bf0   docker.all-hands.dev/all-hands-ai/openhands:e5b5bf0

github-actions · 2024-11-16T00:58:04Z

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

enyst · 2024-11-22T21:06:52Z

@openhands-agent The last run-integration-tests job, which was added by this PR, has failed. Use the github API to retrieve and read the job logs, and fix it.

github-actions · 2024-11-22T21:08:02Z

OpenHands started fixing the pr! You can monitor the progress here.

github-actions · 2024-11-23T02:15:49Z

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

github-actions · 2024-11-23T02:20:00Z

Trigger by: Pull Request (integration-test label on PR #5077)
Commit: 5a902c4
Integration Tests Evaluation Report
Success rate: 0.00% (0/6)

instance_id	success	reason
t02_add_bash_hello	False	Failed to cat /workspace/hello.sh: cat: /workspace/hello.sh: No such file or directory.
t06_github_pr_browsing	False	The answer is not found in any message. Total messages: 0. Messages: []
t04_git_staging	False	Failed to check for "nothing to commit, working tree clean": On branch master

		No commits yet

		Changes to be committed:
		(use "git rm --cached ..." to unstage)
		new file: hello.py.
t05_simple_browsing	False	The answer is not found in any message. Total messages: 0. Messages: []
t01_fix_simple_typo	False	File not fixed: This is a stupid typoo.
		Really?
		No mor typos!
		Enjoy!
t03_jupyter_write_file	False	Failed to cat /workspace/test.txt: cat: /workspace/test.txt: No such file or directory.

You can download the full evaluation outputs here.

This reverts commit dcd4681.

…round into int/openhands-fix-issue-5076

…nhands-fix-issue-5076

This reverts commit 7e7200e.

mamoodi · 2024-11-25T21:27:52Z

I think enyst has eyes on this.

enyst · 2024-11-25T21:32:03Z

Yes! Browsing works too, now:

enyst#8 (comment)

…nhands-fix-issue-5076

xingyaoww

LGTM!

mamoodi · 2024-11-27T17:24:28Z

Looks good! Just had a question for my own curiosity. Do you know the cost associated with running these integration tests daily for haiku and deepseek?

mamoodi · 2024-11-27T17:27:20Z

Also did you get Haiku and Deepseek setup? Referring to the description:
"Please note that Haiku and Deepseek would need to be set up, if we want to do this here. Deepseek has been set up in the past, but it's not working (key depleted?)."

github-actions · 2024-11-27T18:13:03Z

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

github-actions · 2024-11-27T18:18:38Z

Trigger by: Pull Request (integration-test label on PR #5077)
Commit: f8cc71d
Integration Tests Report (Haiku)
Haiku LLM Test Results:
Success rate: 0.00% (0/6)

instance_id	success	reason
t06_github_pr_browsing	False	The answer is not found in any message. Total messages: 1.
t01_fix_simple_typo	False	File not fixed: This is a stupid typoo.
		Really?
		No mor typos!
		Enjoy!
t05_simple_browsing	False	The answer is not found in any message. Total messages: 1.
t03_jupyter_write_file	False	Failed to cat /workspace/test.txt: cat: /workspace/test.txt: No such file or directory.
t02_add_bash_hello	False	Failed to cat /workspace/hello.sh: cat: /workspace/hello.sh: No such file or directory.
t04_git_staging	False	Failed to check for "nothing to commit, working tree clean": On branch master

		No commits yet

		Changes to be committed:
		(use "git rm --cached ..." to unstage)
		new file: hello.py.

Integration Tests Report (DeepSeek)
DeepSeek LLM Test Results:
Success rate: 0.00% (0/6)

instance_id	success	reason
t05_simple_browsing	False	The answer is not found in any message. Total messages: 1.
t01_fix_simple_typo	False	File not fixed: This is a stupid typoo.
		Really?
		No mor typos!
		Enjoy!
t02_add_bash_hello	False	Failed to cat /workspace/hello.sh: cat: /workspace/hello.sh: No such file or directory.
t06_github_pr_browsing	False	The answer is not found in any message. Total messages: 1.
t04_git_staging	False	Failed to check for "nothing to commit, working tree clean": On branch master

		No commits yet

		Changes to be committed:
		(use "git rm --cached ..." to unstage)
		new file: hello.py.
t03_jupyter_write_file	False	Failed to cat /workspace/test.txt: cat: /workspace/test.txt: No such file or directory.

Download evaluation outputs (includes both Haiku and DeepSeek results): Download

github-actions · 2024-11-27T19:32:56Z

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

github-actions · 2024-11-27T19:39:08Z

Trigger by: Pull Request (integration-test label on PR #5077)
Commit: f4553f5
Integration Tests Report (Haiku)
Haiku LLM Test Results:
Success rate: 100.00% (6/6)

instance_id	success	reason
t03_jupyter_write_file	True
t01_fix_simple_typo	True
t02_add_bash_hello	True
t05_simple_browsing	True
t06_github_pr_browsing	True
t04_git_staging	True

Integration Tests Report (DeepSeek)
DeepSeek LLM Test Results:
Success rate: 0.00% (0/6)

instance_id	success	reason
t06_github_pr_browsing	False	The answer is not found in any message. Total messages: 1.
t03_jupyter_write_file	False	Failed to cat /workspace/test.txt: cat: /workspace/test.txt: No such file or directory.
t02_add_bash_hello	False	Failed to cat /workspace/hello.sh: cat: /workspace/hello.sh: No such file or directory.
t01_fix_simple_typo	False	File not fixed: This is a stupid typoo.
		Really?
		No mor typos!
		Enjoy!
t05_simple_browsing	False	The answer is not found in any message. Total messages: 1.
t04_git_staging	False	Failed to check for "nothing to commit, working tree clean": On branch master

		No commits yet

		Changes to be committed:
		(use "git rm --cached ..." to unstage)
		new file: hello.py.

Download evaluation outputs (includes both Haiku and DeepSeek results): Download

enyst · 2024-11-27T19:54:23Z

Looks good! Just had a question for my own curiosity. Do you know the cost associated with running these integration tests daily for haiku and deepseek?

They're 6 tasks, each limited to 10 iterations max. 2/6 go on the internet. Nevertheless, the costs I'm seeing with Haiku are USD 0.02-0.05 per task.

Deepseek is lower.

I will fix this to display them all by default. I think it looks super reasonable, though?

github-actions · 2024-11-27T22:40:57Z

Trigger by: Nightly Scheduled Run
Commit: 4374b4a
Integration Tests Report (Haiku)
Haiku LLM Test Results:
Success rate: 100.00% (6/6)

instance_id	success	reason
t04_git_staging	True
t03_jupyter_write_file	True
t01_fix_simple_typo	True
t06_github_pr_browsing	True
t02_add_bash_hello	True
t05_simple_browsing	True

Integration Tests Report (DeepSeek)
DeepSeek LLM Test Results:
Success rate: 0.00% (0/6)

instance_id	success	reason
t06_github_pr_browsing	False	The answer is not found in any message. Total messages: 1.
t02_add_bash_hello	False	Failed to cat /workspace/hello.sh: cat: /workspace/hello.sh: No such file or directory.
t01_fix_simple_typo	False	File not fixed: This is a stupid typoo.
		Really?
		No mor typos!
		Enjoy!
t03_jupyter_write_file	False	Failed to cat /workspace/test.txt: cat: /workspace/test.txt: No such file or directory.
t05_simple_browsing	False	The answer is not found in any message. Total messages: 1.
t04_git_staging	False	Failed to check for "nothing to commit, working tree clean": On branch master

		No commits yet

		Changes to be committed:
		(use "git rm --cached ..." to unstage)
		new file: hello.py.

Download evaluation outputs (includes both Haiku and DeepSeek results): Download

github-actions · 2024-11-28T22:41:16Z

Trigger by: Nightly Scheduled Run
Commit: 59532c9
Integration Tests Report (Haiku)
Haiku LLM Test Results:
Success rate: 100.00% (6/6)

instance_id	success	reason
t04_git_staging	True
t03_jupyter_write_file	True
t01_fix_simple_typo	True
t06_github_pr_browsing	True
t05_simple_browsing	True
t02_add_bash_hello	True

Integration Tests Report (DeepSeek)
DeepSeek LLM Test Results:
Success rate: 0.00% (0/6)

instance_id	success	reason
t05_simple_browsing	False	The answer is not found in any message. Total messages: 1.
t02_add_bash_hello	False	Failed to cat /workspace/hello.sh: cat: /workspace/hello.sh: No such file or directory.
t06_github_pr_browsing	False	The answer is not found in any message. Total messages: 1.
t03_jupyter_write_file	False	Failed to cat /workspace/test.txt: cat: /workspace/test.txt: No such file or directory.
t01_fix_simple_typo	False	File not fixed: This is a stupid typoo.
		Really?
		No mor typos!
		Enjoy!
t04_git_staging	False	Failed to check for "nothing to commit, working tree clean": On branch master

		No commits yet

		Changes to be committed:
		(use "git rm --cached ..." to unstage)
		new file: hello.py.

Download evaluation outputs (includes both Haiku and DeepSeek results): Download

Fix issue #5076: Integration test github action

eaf3057

github-actions bot mentioned this pull request Nov 16, 2024

Integration test github action #5076

Closed

enyst added the integration-test label Nov 16, 2024

enyst added 3 commits November 23, 2024 03:03

Update integration-runner.yml

14fe4c6

Update integration-runner.yml

b415ad2

Merge branch 'main' into openhands-fix-issue-5076

59be01d

enyst added integration-test and removed integration-test labels Nov 23, 2024

enyst added 4 commits November 23, 2024 15:31

update variables

0fd1ddf

use haiku

bc3f136

use base url

73e8837

fix report name

7af3518

enyst mentioned this pull request Nov 23, 2024

[Bug]: eval-this workflow is not working #5106

Open

1 task

openhands-agent and others added 12 commits November 25, 2024 15:39

Fix pr #8: Integration tests (openhands fix issue 5076)

dcd4681

Revert "Fix pr #8: Integration tests (openhands fix issue 5076)"

1a24a94

This reverts commit dcd4681.

Fix pr #8: Integration tests (openhands fix issue 5076)

5e5eb0f

use haiku explicitly, in results too

1f90867

Merge branch 'int/openhands-fix-issue-5076' of github.com:enyst/playg…

4406794

…round into int/openhands-fix-issue-5076

remove duplicate

fa9e651

Merge branch 'main' of github.com:All-Hands-AI/OpenHands into int/ope…

1d848e3

…nhands-fix-issue-5076

Update .github/workflows/integration-runner.yml

7e7200e

Revert "Update .github/workflows/integration-runner.yml"

96ef986

This reverts commit 7e7200e.

funny space

7c2db5b

Fix pr #8: Integration tests (openhands fix issue 5076)

76df32e

artifact fix

7895120

enyst added 5 commits November 25, 2024 22:37

keep details in logs, not github comment

1956f06

tweak schedule

b5c2519

lint-y

0c22181

Merge branch 'main' of github.com:All-Hands-AI/OpenHands into int/ope…

bc4c9a2

…nhands-fix-issue-5076

clean up

605a24f

enyst marked this pull request as ready for review November 26, 2024 23:36

enyst requested review from xingyaoww, neubig, rbren and mamoodi November 26, 2024 23:54

xingyaoww approved these changes Nov 27, 2024

View reviewed changes

set up llms

e5b5bf0

enyst added integration-test and removed integration-test labels Nov 27, 2024

mamoodi approved these changes Nov 27, 2024

View reviewed changes

enyst merged commit f0ca223 into main Nov 27, 2024
23 checks passed

enyst deleted the openhands-fix-issue-5076 branch November 27, 2024 20:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issue #5076: Integration test github action #5077

Fix issue #5076: Integration test github action #5077

openhands-agent commented Nov 16, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Nov 16, 2024

enyst commented Nov 22, 2024

github-actions bot commented Nov 22, 2024

github-actions bot commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

mamoodi commented Nov 25, 2024

enyst commented Nov 25, 2024

xingyaoww left a comment

mamoodi commented Nov 27, 2024

mamoodi commented Nov 27, 2024

github-actions bot commented Nov 27, 2024

github-actions bot commented Nov 27, 2024

github-actions bot commented Nov 27, 2024

github-actions bot commented Nov 27, 2024

enyst commented Nov 27, 2024

github-actions bot commented Nov 27, 2024

github-actions bot commented Nov 28, 2024

Fix issue #5076: Integration test github action #5077

Fix issue #5076: Integration test github action #5077

Conversation

openhands-agent commented Nov 16, 2024 • edited by github-actions bot Loading

github-actions bot commented Nov 16, 2024

enyst commented Nov 22, 2024

github-actions bot commented Nov 22, 2024

github-actions bot commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

mamoodi commented Nov 25, 2024

enyst commented Nov 25, 2024

xingyaoww left a comment

Choose a reason for hiding this comment

mamoodi commented Nov 27, 2024

mamoodi commented Nov 27, 2024

github-actions bot commented Nov 27, 2024

github-actions bot commented Nov 27, 2024

github-actions bot commented Nov 27, 2024

github-actions bot commented Nov 27, 2024

enyst commented Nov 27, 2024

github-actions bot commented Nov 27, 2024

github-actions bot commented Nov 28, 2024

openhands-agent commented Nov 16, 2024 •

edited by github-actions bot

Loading