Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue #5076: Integration test github action #5077

Merged
merged 38 commits into from
Nov 27, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
eaf3057
Fix issue #5076: Integration test github action
openhands-agent Nov 16, 2024
14fe4c6
Update integration-runner.yml
enyst Nov 23, 2024
b415ad2
Update integration-runner.yml
enyst Nov 23, 2024
59be01d
Merge branch 'main' into openhands-fix-issue-5076
enyst Nov 23, 2024
0fd1ddf
update variables
enyst Nov 23, 2024
bc3f136
use haiku
enyst Nov 23, 2024
73e8837
use base url
enyst Nov 23, 2024
7af3518
fix report name
enyst Nov 23, 2024
dcd4681
Fix pr #8: Integration tests (openhands fix issue 5076)
openhands-agent Nov 25, 2024
1a24a94
Revert "Fix pr #8: Integration tests (openhands fix issue 5076)"
enyst Nov 25, 2024
5e5eb0f
Fix pr #8: Integration tests (openhands fix issue 5076)
openhands-agent Nov 25, 2024
1f90867
use haiku explicitly, in results too
enyst Nov 25, 2024
4406794
Merge branch 'int/openhands-fix-issue-5076' of github.com:enyst/playg…
enyst Nov 25, 2024
fa9e651
remove duplicate
enyst Nov 25, 2024
1d848e3
Merge branch 'main' of github.com:All-Hands-AI/OpenHands into int/ope…
enyst Nov 25, 2024
7e7200e
Update .github/workflows/integration-runner.yml
enyst Nov 25, 2024
96ef986
Revert "Update .github/workflows/integration-runner.yml"
enyst Nov 25, 2024
7c2db5b
funny space
enyst Nov 25, 2024
76df32e
Fix pr #8: Integration tests (openhands fix issue 5076)
openhands-agent Nov 25, 2024
7895120
artifact fix
enyst Nov 25, 2024
4e178d5
clean up remote runtimes
enyst Nov 25, 2024
fa25445
clean up runtimes more aggressively - a bit unexpected though
enyst Nov 25, 2024
4ceda73
Fix pr #8: Integration tests (openhands fix issue 5076)
openhands-agent Nov 25, 2024
194a1fb
fix type issue that was preventing checking results
enyst Nov 25, 2024
57d5906
try with waiting time
enyst Nov 25, 2024
cafedcb
add eval notes
enyst Nov 25, 2024
f935f0d
increase timeouts
enyst Nov 25, 2024
34a30ee
try with CI local builds
enyst Nov 25, 2024
d48fac0
fix eval output
enyst Nov 25, 2024
d4a21d0
set debug
enyst Nov 25, 2024
e391604
fix tests!
enyst Nov 25, 2024
6ff6fe2
fix outputs
enyst Nov 25, 2024
1956f06
keep details in logs, not github comment
enyst Nov 25, 2024
b5c2519
tweak schedule
enyst Nov 25, 2024
0c22181
lint-y
enyst Nov 25, 2024
bc4c9a2
Merge branch 'main' of github.com:All-Hands-AI/OpenHands into int/ope…
enyst Nov 25, 2024
605a24f
clean up
enyst Nov 26, 2024
e5b5bf0
set up llms
enyst Nov 27, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 1 addition & 22 deletions .github/workflows/eval-runner.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: Run Evaluation
name: Run SWE-Bench Evaluation

on:
pull_request:
Expand Down Expand Up @@ -58,24 +58,6 @@ jobs:
echo "api_key = \"$DEEPSEEK_API_KEY\"" >> config.toml
echo "temperature = 0.0" >> config.toml
- name: Run integration test evaluation
env:
ALLHANDS_API_KEY: ${{ secrets.ALLHANDS_EVAL_RUNTIME_API_KEY }}
RUNTIME: remote
SANDBOX_REMOTE_RUNTIME_API_URL: https://runtime.eval.all-hands.dev
EVAL_DOCKER_IMAGE_PREFIX: us-central1-docker.pkg.dev/evaluation-092424/swe-bench-images

run: |
poetry run ./evaluation/integration_tests/scripts/run_infer.sh llm.eval HEAD CodeActAgent '' $N_PROCESSES
# get evaluation report
REPORT_FILE=$(find evaluation/evaluation_outputs/outputs/integration_tests/CodeActAgent/deepseek-chat_maxiter_10_N* -name "report.md" -type f | head -n 1)
echo "REPORT_FILE: $REPORT_FILE"
echo "INTEGRATION_TEST_REPORT<<EOF" >> $GITHUB_ENV
cat $REPORT_FILE >> $GITHUB_ENV
echo >> $GITHUB_ENV
echo "EOF" >> $GITHUB_ENV
- name: Run SWE-Bench evaluation
env:
ALLHANDS_API_KEY: ${{ secrets.ALLHANDS_EVAL_RUNTIME_API_KEY }}
Expand Down Expand Up @@ -143,9 +125,6 @@ jobs:
**SWE-Bench Evaluation Report**
${{ env.SWEBENCH_REPORT }}
---
**Integration Tests Evaluation Report**
${{ env.INTEGRATION_TEST_REPORT }}
---
You can download the full evaluation outputs [here](${{ env.ARTIFACT_URL }}).
- name: Post to a Slack channel
Expand Down
158 changes: 158 additions & 0 deletions .github/workflows/integration-runner.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
name: Run Integration Tests

on:
pull_request:
types: [labeled]
workflow_dispatch:
inputs:
reason:
description: 'Reason for manual trigger'
required: true
default: ''
schedule:
- cron: '30 22 * * *' # Runs at 10:30pm UTC every day

env:
N_PROCESSES: 10 # Global configuration for number of parallel processes for evaluation

jobs:
run-integration-tests:
if: github.event.label.name == 'integration-test' || github.event_name == 'workflow_dispatch' || github.event_name == 'schedule'
runs-on: ubuntu-latest
permissions:
contents: "read"
id-token: "write"
pull-requests: "write"
issues: "write"
strategy:
matrix:
python-version: ["3.12"]
steps:
- name: Checkout repository
uses: actions/checkout@v4

- name: Install poetry via pipx
run: pipx install poetry

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
cache: "poetry"

- name: Comment on PR if 'integration-test' label is present
if: github.event_name == 'pull_request' && github.event.label.name == 'integration-test'
uses: KeisukeYamashita/create-comment@v1
with:
unique: false
comment: |
Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

- name: Install Python dependencies using Poetry
run: poetry install --without evaluation,llama-index

- name: Configure config.toml for testing with Haiku
env:
LLM_MODEL: "litellm_proxy/claude-3-5-haiku-20241022"
LLM_API_KEY: ${{ secrets.LLM_API_KEY }}
LLM_BASE_URL: ${{ secrets.LLM_BASE_URL }}
run: |
echo "[llm.eval]" > config.toml
echo "model = \"$LLM_MODEL\"" >> config.toml
echo "api_key = \"$LLM_API_KEY\"" >> config.toml
echo "base_url = \"$LLM_BASE_URL\"" >> config.toml
echo "temperature = 0.0" >> config.toml

- name: Build environment
run: make build

- name: Run integration test evaluation for Haiku
env:
SANDBOX_FORCE_REBUILD_RUNTIME: True
run: |
poetry run ./evaluation/integration_tests/scripts/run_infer.sh llm.eval HEAD CodeActAgent '' $N_PROCESSES '' 'haiku_run'

# get integration tests report
REPORT_FILE_HAIKU=$(find evaluation/evaluation_outputs/outputs/integration_tests/CodeActAgent/*haiku*_maxiter_10_N* -name "report.md" -type f | head -n 1)
echo "REPORT_FILE: $REPORT_FILE_HAIKU"
echo "INTEGRATION_TEST_REPORT_HAIKU<<EOF" >> $GITHUB_ENV
cat $REPORT_FILE_HAIKU >> $GITHUB_ENV
echo >> $GITHUB_ENV
echo "EOF" >> $GITHUB_ENV

- name: Wait a little bit
run: sleep 10

- name: Configure config.toml for testing with DeepSeek
env:
LLM_MODEL: "litellm_proxy/deepseek-chat"
LLM_API_KEY: ${{ secrets.LLM_API_KEY }}
LLM_BASE_URL: ${{ secrets.LLM_BASE_URL }}
run: |
echo "[llm.eval]" > config.toml
echo "model = \"$LLM_MODEL\"" >> config.toml
echo "api_key = \"$LLM_API_KEY\"" >> config.toml
echo "base_url = \"$LLM_BASE_URL\"" >> config.toml
echo "temperature = 0.0" >> config.toml

- name: Run integration test evaluation for DeepSeek
env:
SANDBOX_FORCE_REBUILD_RUNTIME: True
run: |
poetry run ./evaluation/integration_tests/scripts/run_infer.sh llm.eval HEAD CodeActAgent '' $N_PROCESSES '' 'deepseek_run'

# get integration tests report
REPORT_FILE_DEEPSEEK=$(find evaluation/evaluation_outputs/outputs/integration_tests/CodeActAgent/deepseek*_maxiter_10_N* -name "report.md" -type f | head -n 1)
echo "REPORT_FILE: $REPORT_FILE_DEEPSEEK"
echo "INTEGRATION_TEST_REPORT_DEEPSEEK<<EOF" >> $GITHUB_ENV
cat $REPORT_FILE_DEEPSEEK >> $GITHUB_ENV
echo >> $GITHUB_ENV
echo "EOF" >> $GITHUB_ENV

- name: Create archive of evaluation outputs
run: |
TIMESTAMP=$(date +'%y-%m-%d-%H-%M')
cd evaluation/evaluation_outputs/outputs # Change to the outputs directory
tar -czvf ../../../integration_tests_${TIMESTAMP}.tar.gz integration_tests/CodeActAgent/* # Only include the actual result directories

- name: Upload evaluation results as artifact
uses: actions/upload-artifact@v4
id: upload_results_artifact
with:
name: integration-test-outputs-${{ github.run_id }}-${{ github.run_attempt }}
path: integration_tests_*.tar.gz

- name: Get artifact URLs
run: |
echo "ARTIFACT_URL=${{ steps.upload_results_artifact.outputs.artifact-url }}" >> $GITHUB_ENV

- name: Set timestamp and trigger reason
run: |
echo "TIMESTAMP=$(date +'%Y-%m-%d-%H-%M')" >> $GITHUB_ENV
if [[ "${{ github.event_name }}" == "pull_request" ]]; then
echo "TRIGGER_REASON=pr-${{ github.event.pull_request.number }}" >> $GITHUB_ENV
elif [[ "${{ github.event_name }}" == "workflow_dispatch" ]]; then
echo "TRIGGER_REASON=manual-${{ github.event.inputs.reason }}" >> $GITHUB_ENV
else
echo "TRIGGER_REASON=nightly-scheduled" >> $GITHUB_ENV
fi

- name: Comment with results and artifact link
id: create_comment
uses: KeisukeYamashita/create-comment@v1
with:
# if triggered by PR, use PR number, otherwise use 5077 as fallback issue number for manual triggers
number: ${{ github.event_name == 'pull_request' && github.event.pull_request.number || 5077 }}
unique: false
comment: |
Trigger by: ${{ github.event_name == 'pull_request' && format('Pull Request (integration-test label on PR #{0})', github.event.pull_request.number) || (github.event_name == 'workflow_dispatch' && format('Manual Trigger: {0}', github.event.inputs.reason)) || 'Nightly Scheduled Run' }}
Commit: ${{ github.sha }}
**Integration Tests Report (Haiku)**
Haiku LLM Test Results:
${{ env.INTEGRATION_TEST_REPORT_HAIKU }}
---
**Integration Tests Report (DeepSeek)**
DeepSeek LLM Test Results:
${{ env.INTEGRATION_TEST_REPORT_DEEPSEEK }}
---
Download evaluation outputs (includes both Haiku and DeepSeek results): [Download](${{ steps.upload_results_artifact.outputs.artifact-url }})
17 changes: 14 additions & 3 deletions evaluation/integration_tests/run_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,13 +48,19 @@ def get_config(
# use default base_container_image
enable_auto_lint=True,
use_host_network=False,
timeout=100,
timeout=300,
# Add platform to the sandbox config to solve issue 4401
platform='linux/amd64',
api_key=os.environ.get('ALLHANDS_API_KEY', None),
remote_runtime_api_url=os.environ.get('SANDBOX_REMOTE_RUNTIME_API_URL'),
keep_runtime_alive=False,
remote_runtime_init_timeout=3600,
),
# do not mount workspace
workspace_base=None,
workspace_mount_path=None,
# debug
debug=True,
)
config.set_llm_config(
update_llm_config_for_completions_logging(
Expand Down Expand Up @@ -129,7 +135,12 @@ def process_instance(
# # result evaluation
# # =============================================

histories = [event_to_dict(event) for event in state.history]
histories = state.history

# some basic check
logger.info(f'Total events in history: {len(histories)}')
assert len(histories) > 0, 'History should not be empty'

test_result: TestResult = test_class.verify_result(runtime, histories)
metrics = state.metrics.get() if state.metrics else None

Expand All @@ -139,7 +150,7 @@ def process_instance(
instance=instance.to_dict(),
instruction=instruction,
metadata=metadata,
history=histories,
history=[event_to_dict(event) for event in histories],
metrics=metrics,
error=state.last_error if state and state.last_error else None,
test_result=test_result.model_dump(),
Expand Down
34 changes: 23 additions & 11 deletions evaluation/integration_tests/tests/t05_simple_browsing.py
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,8 @@ def initialize_runtime(cls, runtime: Runtime) -> None:

@classmethod
def verify_result(cls, runtime: Runtime, histories: list[Event]) -> TestResult:
from openhands.core.logger import openhands_logger as logger

# check if the "The answer is OpenHands is all you need!" is in any message
message_actions = [
event
Expand All @@ -116,19 +118,29 @@ def verify_result(cls, runtime: Runtime, histories: list[Event]) -> TestResult:
event, (MessageAction, AgentFinishAction, AgentDelegateObservation)
)
]
logger.debug(f'Total message-like events: {len(message_actions)}')

for event in message_actions:
if isinstance(event, AgentDelegateObservation):
content = event.content
elif isinstance(event, AgentFinishAction):
content = event.outputs.get('content', '')
elif isinstance(event, MessageAction):
content = event.content
else:
raise ValueError(f'Unknown event type: {type(event)}')
try:
if isinstance(event, AgentDelegateObservation):
content = event.content
elif isinstance(event, AgentFinishAction):
content = event.outputs.get('content', '')
elif isinstance(event, MessageAction):
content = event.content
else:
logger.warning(f'Unexpected event type: {type(event)}')
continue

if 'OpenHands is all you need!' in content:
return TestResult(success=True)
if 'OpenHands is all you need!' in content:
return TestResult(success=True)
except Exception as e:
logger.error(f'Error processing event: {e}')

logger.debug(
f'Total messages: {len(message_actions)}. Messages: {message_actions}'
)
return TestResult(
success=False,
reason=f'The answer is not found in any message. Total messages: {len(message_actions)}. Messages: {message_actions}',
reason=f'The answer is not found in any message. Total messages: {len(message_actions)}.',
)
46 changes: 29 additions & 17 deletions evaluation/integration_tests/tests/t06_github_pr_browsing.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,31 +14,43 @@ def initialize_runtime(cls, runtime: Runtime) -> None:

@classmethod
def verify_result(cls, runtime: Runtime, histories: list[Event]) -> TestResult:
# check if the "The answer is OpenHands is all you need!" is in any message
from openhands.core.logger import openhands_logger as logger

# check if the license information is in any message
message_actions = [
event
for event in histories
if isinstance(
event, (MessageAction, AgentFinishAction, AgentDelegateObservation)
)
]
logger.info(f'Total message-like events: {len(message_actions)}')

for event in message_actions:
if isinstance(event, AgentDelegateObservation):
content = event.content
elif isinstance(event, AgentFinishAction):
content = event.outputs.get('content', '')
elif isinstance(event, MessageAction):
content = event.content
else:
raise ValueError(f'Unknown event type: {type(event)}')

if (
'non-commercial' in content
or 'MIT' in content
or 'Apache 2.0' in content
):
return TestResult(success=True)
try:
if isinstance(event, AgentDelegateObservation):
content = event.content
elif isinstance(event, AgentFinishAction):
content = event.outputs.get('content', '')
elif isinstance(event, MessageAction):
content = event.content
else:
logger.warning(f'Unexpected event type: {type(event)}')
continue

if (
'non-commercial' in content
or 'MIT' in content
or 'Apache 2.0' in content
):
return TestResult(success=True)
except Exception as e:
logger.error(f'Error processing event: {e}')

logger.debug(
f'Total messages: {len(message_actions)}. Messages: {message_actions}'
)
return TestResult(
success=False,
reason=f'The answer is not found in any message. Total messages: {len(message_actions)}. Messages: {message_actions}',
reason=f'The answer is not found in any message. Total messages: {len(message_actions)}.',
)
Loading