Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions PLUGINS.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,7 @@ See [plugins/openshift/README.md](plugins/openshift/README.md) for detailed docu
A plugin to analyze and inspect Prow CI job results

**Commands:**
- **`/prow-job:analyze-install-failure` `<prowjob-url>`** - Analyze OpenShift installation failures in Prow CI jobs
- **`/prow-job:analyze-resource` `prowjob-url resource-name`** - Analyze Kubernetes resource lifecycle in Prow job artifacts
- **`/prow-job:analyze-test-failure` `prowjob-url test-name`** - Analyzes test errors from console logs and Prow CI job artifacts
- **`/prow-job:extract-must-gather` `prowjob-url`** - Extract and decompress must-gather archives from Prow job artifacts
Expand Down
145 changes: 145 additions & 0 deletions plugins/prow-job/commands/analyze-install-failure.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
---
description: Analyze OpenShift installation failures in Prow CI jobs
argument-hint: <prowjob-url>
---

## Name
prow-job:analyze-install-failure

## Synopsis
```
/prow-job:analyze-install-failure <prowjob-url>
```

## Description

The `prow-job:analyze-install-failure` command analyzes OpenShift installation failures in Prow CI jobs by downloading and examining installer logs, log bundles, and sosreports (for metal jobs). This command is specifically designed to debug failures in the **"install should succeed: overall"** test, which indicates that the installation process failed at some stage.

**Important**: All "install should succeed" tests have a specific suffix indicating the failure stage (configuration, infrastructure, cluster bootstrap, cluster creation, cluster operator stability, or other). The JUnit XML contains both the specific failure reason test (which fails) and the overall test (which also fails when any stage fails). This command analyzes the specific failure stage to provide targeted diagnostics.

The command accepts:
- Prow job URL (required): URL to the failed CI job from prow.ci.openshift.org

It downloads relevant artifacts from Google Cloud Storage, analyzes them, and generates a comprehensive report with findings and recommended next steps.

### Recognized Failure Modes

The command identifies the failure mode from `junit_install.xml` and tailors its analysis:

- **"install should succeed: configuration"** - Extremely rare failure where install-config.yaml validation failed. Focus on installer log only.
- **"install should succeed: infrastructure"** - Failed to create cloud resources. Usually due to cloud quota, rate limiting, or outages. Log bundle may not exist.
- **"install should succeed: cluster bootstrap"** - Bootstrap node failed to start temporary control plane. Check bootkube logs in the bundle.
- **"install should succeed: cluster creation"** - One or more operators unable to deploy. Check operator logs in gather-must-gather.
- **"install should succeed: cluster operator stability"** - Operators never stabilized (stuck progressing/degraded). Check operator status and logs.
- **"install should succeed: other"** - Unknown failure requiring comprehensive analysis of all available logs.

## Implementation

The command performs the following steps by invoking the "Prow Job Analyze Install Failure" skill:

1. **Parse Job URL**: Extract build ID and job details from the Prow URL

2. **Download prowjob.json**: Identify the ci-operator target

3. **Download JUnit XML**: Identify the specific failure mode (configuration, infrastructure, cluster bootstrap, etc.)

4. **Download Installer Logs**: Get `.openshift_install*.log` files that contain the installation timeline

5. **Download Log Bundle**: Get `log-bundle-*.tar` containing:
Copy link

@jianlinliu jianlinliu Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we also mention to download all the files in cloudapi_output-* folder ? Sometimes, there is also some useful log to check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I don't see anything in https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.21-e2e-azure-ovn/1983669156272148480/artifacts/e2e-azure-ovn/ipi-install-install/artifacts/clusterapi_output-1761783362/ that would be helpful - the installer log has CAPI failures in them if they happen

I'd rather just let a CAPI expert on the installer team add more to these skills if they think its useful after it merges

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those manifests are also included in the gather log bundle, so I think we are covered here.

- Bootstrap node journals (bootkube, kubelet, crio, etc.)
- Serial console logs from all nodes
- Cluster API resources (etcd, kube-apiserver logs)
- Failed systemd units list

6. **Invoke Metal Skill** (metal jobs only): Use the specialized `prow-job-analyze-metal-install-failure` skill to analyze:
- Dev-scripts setup logs (installation framework)
- libvirt console logs (VM/node boot sequence)
- sosreport (hypervisor diagnostics)
- squid logs (proxy logs for disconnected environments)

7. **Analyze Logs**: Extract key failure indicators based on failure mode:
- **configuration**: install-config.yaml validation errors
- **infrastructure**: Cloud quota/rate limit/API errors
- **cluster bootstrap**: bootkube/etcd/API server failures
- **cluster creation**: Operator deployment failures
- **cluster operator stability**: Operators stuck in unstable state
- **other**: Comprehensive analysis of all logs

8. **Generate Report**: Create comprehensive analysis with:
- Failure mode and summary
- Timeline of events
- Key error messages with context
- Failure mode-specific recommended steps
- Artifact locations

The skill handles all the implementation details including URL parsing, artifact downloading, archive extraction, log analysis, and report generation.

## Return Value
- **Success**: Comprehensive analysis report saved to `.work/prow-job-analyze-install-failure/{build_id}/analysis/report.txt`
- **Error**: Error message explaining the issue (invalid URL, gcloud not installed, artifacts not found, etc.)

**Important for Claude**:
1. Parse the Prow job URL to extract the build ID and job name
2. Invoke the "prow-job:analyze-install-failure" skill with the job details
3. The skill will download all relevant artifacts and analyze them
4. For metal jobs, the skill automatically invokes the specialized metal install failure skill
5. Present the analysis report to the user with clear findings
6. Provide actionable next steps based on the failure mode
7. Highlight critical errors and their context

## Examples

1. **Analyze an AWS installation failure**:
```
/prow-job:analyze-install-failure https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.21-e2e-aws-ovn-techpreview/1983307151598161920
```
Expected output:
- Downloads installer logs and log bundle
- Identifies failure mode from junit_install.xml
- Analyzes bootstrap and installation logs
- Reports: "Bootstrap failed due to etcd cluster formation timeout"
- Provides etcd logs and recommendations

2. **Analyze a metal installation failure**:
```
/prow-job:analyze-install-failure https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ipi-ovn-ipv6/1983304069657137152
```
Expected output:
- Invokes specialized metal install failure skill
- Downloads dev-scripts logs, libvirt console logs, sosreport
- Analyzes dev-scripts setup and VM console logs
- Reports: "Bootstrap VM failed to boot - Ignition config fetch failed"
- Provides console logs and dev-scripts analysis

3. **Analyze an infrastructure failure**:
```
/prow-job:analyze-install-failure https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift/installer/12345/pull-ci-openshift-installer-master-e2e-aws/7890
```
Expected output:
- Identifies "install should succeed: infrastructure" failure
- Focuses on installer log (no log bundle expected)
- Reports: "Cloud quota exceeded for instance type m5.xlarge"
- Recommends checking quota limits

4. **Analyze an operator stability failure**:
```
/prow-job:analyze-install-failure https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.21-e2e-gcp/1234567890123456789
```
Expected output:
- Identifies "install should succeed: cluster operator stability" failure
- Checks gather-must-gather for operator logs
- Reports: "kube-apiserver operator stuck progressing"
- Provides operator logs and status conditions

## Notes

- **Failure Modes**: The installer has multiple failure modes detected from junit_install.xml. Each mode requires different analysis approaches.
- **Log Bundle**: Contains detailed node-level diagnostics including journals, serial consoles, and cluster API resources
- **Metal Jobs**: Identified by "metal" in the job name. These jobs automatically invoke the specialized `prow-job-analyze-metal-install-failure` skill.
- **Metal Artifacts**: Metal jobs analyze dev-scripts logs, libvirt console logs, sosreport, and squid logs
- **Artifacts Location**: All downloaded artifacts are cached in `.work/prow-job-analyze-install-failure/{build_id}/` for faster re-analysis
- **gcloud Requirement**: Requires gcloud CLI to be installed to access GCS buckets

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct? The buckets are public so you can curl.

Copy link
Member Author

@stbenjam stbenjam Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude seems to do better with the CLI tools (it can search all subdirectories for files easier) and its what others are using for gcs buckets in ai-helpers

- **Public Access**: The test-platform-results bucket is publicly accessible - no authentication needed

## Arguments
- **$1** (prowjob-url): The Prow job URL from prow.ci.openshift.org (required)
Loading