-
Notifications
You must be signed in to change notification settings - Fork 85
feat(prow-job): add analyze-install-failure command with metal support #79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(prow-job): add analyze-install-failure command with metal support #79
Conversation
92862c1 to
fc2ff94
Compare
| - **Metal Jobs**: Identified by "metal" in the job name. These jobs automatically invoke the specialized `prow-job-analyze-metal-install-failure` skill. | ||
| - **Metal Artifacts**: Metal jobs analyze dev-scripts logs, libvirt console logs, sosreport, and squid logs | ||
| - **Artifacts Location**: All downloaded artifacts are cached in `.work/prow-job-analyze-install-failure/{build_id}/` for faster re-analysis | ||
| - **gcloud Requirement**: Requires gcloud CLI to be installed to access GCS buckets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this correct? The buckets are public so you can curl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Claude seems to do better with the CLI tools (it can search all subdirectories for files easier) and its what others are using for gcs buckets in ai-helpers
plugins/prow-job/skills/prow-job-analyze-install-failure/SKILL.md
Outdated
Show resolved
Hide resolved
|
This looks cool! I'd be really keen to see some real life examples of this being used :D With a long skill like this, I'm always curious about how changes to the prompt / markdown headings etc impact output. This might be a decent example of a problem where we care about evals? Although, we'll usually have a human in the loop, so not as much as something like API review I suppose.. |
Add command to analyze installation failures in Prow CI jobs with dedicated metal skill for bare metal debugging. Main skill handles installer logs and log bundles, while metal skill analyzes dev-scripts logs, libvirt console logs, sosreport, and squid proxy logs. Squashed from 2 commits: - 7c45e35 feat(prow-job): add analyze-install-failure command - 00f828c docs(prow-job): create dedicated metal install failure skill
7dd840c to
9ce5f1b
Compare
|
|
||
| 4. **Download Installer Logs**: Get `.openshift_install*.log` files that contain the installation timeline | ||
|
|
||
| 5. **Download Log Bundle**: Get `log-bundle-*.tar` containing: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we also mention to download all the files in cloudapi_output-* folder ? Sometimes, there is also some useful log to check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I don't see anything in https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.21-e2e-azure-ovn/1983669156272148480/artifacts/e2e-azure-ovn/ipi-install-install/artifacts/clusterapi_output-1761783362/ that would be helpful - the installer log has CAPI failures in them if they happen
I'd rather just let a CAPI expert on the installer team add more to these skills if they think its useful after it merges
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those manifests are also included in the gather log bundle, so I think we are covered here.
|
/lgtm |
|
@patrickdillon: changing LGTM is restricted to collaborators In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: patrickdillon, stbenjam The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What this PR does / why we need it:
Analyze OpenShift installation failures in Prow CI jobs by examining installer logs, log bundles, and sosreports. Trying to write down everything I know about debugging installer failures e.g. stbenjam-install-debugging-as-a-service.
Which issue(s) this PR fixes:
N/A
Special notes for your reviewer:
Checklist: