Skip to content

Conversation

@stbenjam
Copy link
Member

What this PR does / why we need it:

Analyze OpenShift installation failures in Prow CI jobs by examining installer logs, log bundles, and sosreports. Trying to write down everything I know about debugging installer failures e.g. stbenjam-install-debugging-as-a-service.

Which issue(s) this PR fixes:

N/A

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 29, 2025
@stbenjam stbenjam force-pushed the install-failures branch 2 times, most recently from 92862c1 to fc2ff94 Compare October 29, 2025 12:13
- **Metal Jobs**: Identified by "metal" in the job name. These jobs automatically invoke the specialized `prow-job-analyze-metal-install-failure` skill.
- **Metal Artifacts**: Metal jobs analyze dev-scripts logs, libvirt console logs, sosreport, and squid logs
- **Artifacts Location**: All downloaded artifacts are cached in `.work/prow-job-analyze-install-failure/{build_id}/` for faster re-analysis
- **gcloud Requirement**: Requires gcloud CLI to be installed to access GCS buckets

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct? The buckets are public so you can curl.

Copy link
Member Author

@stbenjam stbenjam Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude seems to do better with the CLI tools (it can search all subdirectories for files easier) and its what others are using for gcs buckets in ai-helpers

@theobarberbany
Copy link
Contributor

This looks cool! I'd be really keen to see some real life examples of this being used :D

With a long skill like this, I'm always curious about how changes to the prompt / markdown headings etc impact output. This might be a decent example of a problem where we care about evals? Although, we'll usually have a human in the loop, so not as much as something like API review I suppose..

Add command to analyze installation failures in Prow CI jobs with dedicated
metal skill for bare metal debugging. Main skill handles installer logs and
log bundles, while metal skill analyzes dev-scripts logs, libvirt console
logs, sosreport, and squid proxy logs.

Squashed from 2 commits:
- 7c45e35 feat(prow-job): add analyze-install-failure command
- 00f828c docs(prow-job): create dedicated metal install failure skill

4. **Download Installer Logs**: Get `.openshift_install*.log` files that contain the installation timeline

5. **Download Log Bundle**: Get `log-bundle-*.tar` containing:
Copy link

@jianlinliu jianlinliu Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we also mention to download all the files in cloudapi_output-* folder ? Sometimes, there is also some useful log to check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I don't see anything in https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.21-e2e-azure-ovn/1983669156272148480/artifacts/e2e-azure-ovn/ipi-install-install/artifacts/clusterapi_output-1761783362/ that would be helpful - the installer log has CAPI failures in them if they happen

I'd rather just let a CAPI expert on the installer team add more to these skills if they think its useful after it merges

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those manifests are also included in the gather log bundle, so I think we are covered here.

@patrickdillon
Copy link

/lgtm

@openshift-ci
Copy link

openshift-ci bot commented Oct 31, 2025

@patrickdillon: changing LGTM is restricted to collaborators

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci
Copy link

openshift-ci bot commented Oct 31, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: patrickdillon, stbenjam

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@stbenjam stbenjam added the lgtm Indicates that a PR is ready to be merged. label Oct 31, 2025
@openshift-merge-bot openshift-merge-bot bot merged commit 6928b14 into openshift-eng:main Oct 31, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants