-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix issue #5076: Integration test github action #5077
Conversation
Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly. |
@openhands-agent The last run-integration-tests job, which was added by this PR, has failed. Use the github API to retrieve and read the job logs, and fix it. |
Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly. |
Trigger by: Pull Request (integration-test label on PR #5077)
You can download the full evaluation outputs here. |
I think enyst has eyes on this. |
Yes! Browsing works too, now: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Looks good! Just had a question for my own curiosity. Do you know the cost associated with running these integration tests daily for haiku and deepseek? |
Also did you get Haiku and Deepseek setup? Referring to the description: |
Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly. |
Trigger by: Pull Request (integration-test label on PR #5077)
Integration Tests Report (DeepSeek)
Download evaluation outputs (includes both Haiku and DeepSeek results): Download |
Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly. |
Trigger by: Pull Request (integration-test label on PR #5077)
Integration Tests Report (DeepSeek)
Download evaluation outputs (includes both Haiku and DeepSeek results): Download |
They're 6 tasks, each limited to 10 iterations max. 2/6 go on the internet. Nevertheless, the costs I'm seeing with Haiku are USD 0.02-0.05 per task. Deepseek is lower. I will fix this to display them all by default. I think it looks super reasonable, though? |
Trigger by: Nightly Scheduled Run
Integration Tests Report (DeepSeek)
Download evaluation outputs (includes both Haiku and DeepSeek results): Download |
Trigger by: Nightly Scheduled Run
Integration Tests Report (DeepSeek)
Download evaluation outputs (includes both Haiku and DeepSeek results): Download |
This PR proposes re-addition/fixes to integration tests:
eval-this
workflowintegration-test
Example latest run last night:
Please note that Haiku and Deepseek would need to be set up, if we want to do this here. Deepseek has been set up in the past, but it's not working (key depleted?).
Just please let me know which LLMs or how do you think we could we do something like this here. Meanwhile, you can see them chilling just fine in my playground. 😅
Original
openhands-agent
messageThis pull request fixes #5076.
This issue has been successfully resolved. The PR effectively:
Separated the large eval-runner workflow into two distinct workflows:
integration-runner.yml
for integration testseval-runner.yml
now focused solely on SWE-Bench evaluationProperly configured the triggers for the new integration-runner workflow to only execute:
Successfully moved all necessary prerequisites and configuration sections related to integration testing to the new workflow file while maintaining the functionality.
The solution meets all requirements specified in the original issue, particularly the important condition that the action should only run when labeled or manually triggered. The separation of concerns will make the workflows more maintainable and focused on their specific tasks.
Recommended for review and merge as this represents a clean separation of the evaluation workflows without changing their core functionality.
Automatic fix generated by OpenHands 🙌
To run this PR locally, use the following command: