Skip to content

Commit 92862c1

Browse files
committed
feat(prow-job): add analyze-install-failure command with metal support
Add command to analyze installation failures in Prow CI jobs with dedicated metal skill for bare metal debugging. Main skill handles installer logs and log bundles, while metal skill analyzes dev-scripts logs, libvirt console logs, sosreport, and squid proxy logs. Squashed from 2 commits: - 7c45e35 feat(prow-job): add analyze-install-failure command - 00f828c docs(prow-job): create dedicated metal install failure skill
1 parent a5d4a9b commit 92862c1

File tree

3 files changed

+1104
-0
lines changed

3 files changed

+1104
-0
lines changed
Lines changed: 145 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,145 @@
1+
---
2+
description: Analyze OpenShift installation failures in Prow CI jobs
3+
argument-hint: <prowjob-url>
4+
---
5+
6+
## Name
7+
prow-job:analyze-install-failure
8+
9+
## Synopsis
10+
```
11+
/prow-job:analyze-install-failure <prowjob-url>
12+
```
13+
14+
## Description
15+
16+
The `prow-job:analyze-install-failure` command analyzes OpenShift installation failures in Prow CI jobs by downloading and examining installer logs, log bundles, and sosreports (for metal jobs). This command is specifically designed to debug failures in the **"install should succeed: overall"** test, which indicates that the installation process failed at some stage.
17+
18+
**Important**: All "install should succeed" tests have a specific suffix indicating the failure stage (configuration, infrastructure, cluster bootstrap, cluster creation, cluster operator stability, or other). The JUnit XML contains both the specific failure reason test (which fails) and the overall test (which also fails when any stage fails). This command analyzes the specific failure stage to provide targeted diagnostics.
19+
20+
The command accepts:
21+
- Prow job URL (required): URL to the failed CI job from prow.ci.openshift.org
22+
23+
It downloads relevant artifacts from Google Cloud Storage, analyzes them, and generates a comprehensive report with findings and recommended next steps.
24+
25+
### Recognized Failure Modes
26+
27+
The command identifies the failure mode from `junit_install.xml` and tailors its analysis:
28+
29+
- **"install should succeed: configuration"** - Extremely rare failure where install-config.yaml validation failed. Focus on installer log only.
30+
- **"install should succeed: infrastructure"** - Failed to create cloud resources. Usually due to cloud quota, rate limiting, or outages. Log bundle may not exist.
31+
- **"install should succeed: cluster bootstrap"** - Bootstrap node failed to start temporary control plane. Check bootkube logs in the bundle.
32+
- **"install should succeed: cluster creation"** - One or more operators unable to deploy. Check operator logs in gather-must-gather.
33+
- **"install should succeed: cluster operator stability"** - Operators never stabilized (stuck progressing/degraded). Check operator status and logs.
34+
- **"install should succeed: other"** - Unknown failure requiring comprehensive analysis of all available logs.
35+
36+
## Implementation
37+
38+
The command performs the following steps by invoking the "Prow Job Analyze Install Failure" skill:
39+
40+
1. **Parse Job URL**: Extract build ID and job details from the Prow URL
41+
42+
2. **Download prowjob.json**: Identify the ci-operator target
43+
44+
3. **Download JUnit XML**: Identify the specific failure mode (configuration, infrastructure, cluster bootstrap, etc.)
45+
46+
4. **Download Installer Logs**: Get `.openshift_install*.log` files that contain the installation timeline
47+
48+
5. **Download Log Bundle**: Get `log-bundle-*.tar` containing:
49+
- Bootstrap node journals (bootkube, kubelet, crio, etc.)
50+
- Serial console logs from all nodes
51+
- Cluster API resources (etcd, kube-apiserver logs)
52+
- Failed systemd units list
53+
54+
6. **Invoke Metal Skill** (metal jobs only): Use the specialized `prow-job-analyze-metal-install-failure` skill to analyze:
55+
- Dev-scripts setup logs (installation framework)
56+
- libvirt console logs (VM/node boot sequence)
57+
- sosreport (hypervisor diagnostics)
58+
- squid logs (proxy logs for disconnected environments)
59+
60+
7. **Analyze Logs**: Extract key failure indicators based on failure mode:
61+
- **configuration**: install-config.yaml validation errors
62+
- **infrastructure**: Cloud quota/rate limit/API errors
63+
- **cluster bootstrap**: bootkube/etcd/API server failures
64+
- **cluster creation**: Operator deployment failures
65+
- **cluster operator stability**: Operators stuck in unstable state
66+
- **other**: Comprehensive analysis of all logs
67+
68+
8. **Generate Report**: Create comprehensive analysis with:
69+
- Failure mode and summary
70+
- Timeline of events
71+
- Key error messages with context
72+
- Failure mode-specific recommended steps
73+
- Artifact locations
74+
75+
The skill handles all the implementation details including URL parsing, artifact downloading, archive extraction, log analysis, and report generation.
76+
77+
## Return Value
78+
- **Success**: Comprehensive analysis report saved to `.work/prow-job-analyze-install-failure/{build_id}/analysis/report.txt`
79+
- **Error**: Error message explaining the issue (invalid URL, gcloud not installed, artifacts not found, etc.)
80+
81+
**Important for Claude**:
82+
1. Parse the Prow job URL to extract the build ID and job name
83+
2. Invoke the "prow-job:analyze-install-failure" skill with the job details
84+
3. The skill will download all relevant artifacts and analyze them
85+
4. For metal jobs, the skill automatically invokes the specialized metal install failure skill
86+
5. Present the analysis report to the user with clear findings
87+
6. Provide actionable next steps based on the failure mode
88+
7. Highlight critical errors and their context
89+
90+
## Examples
91+
92+
1. **Analyze an AWS installation failure**:
93+
```
94+
/prow-job:analyze-install-failure https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.21-e2e-aws-ovn-techpreview/1983307151598161920
95+
```
96+
Expected output:
97+
- Downloads installer logs and log bundle
98+
- Identifies failure mode from junit_install.xml
99+
- Analyzes bootstrap and installation logs
100+
- Reports: "Bootstrap failed due to etcd cluster formation timeout"
101+
- Provides etcd logs and recommendations
102+
103+
2. **Analyze a metal installation failure**:
104+
```
105+
/prow-job:analyze-install-failure https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ipi-ovn-ipv6/1983304069657137152
106+
```
107+
Expected output:
108+
- Invokes specialized metal install failure skill
109+
- Downloads dev-scripts logs, libvirt console logs, sosreport
110+
- Analyzes dev-scripts setup and VM console logs
111+
- Reports: "Bootstrap VM failed to boot - Ignition config fetch failed"
112+
- Provides console logs and dev-scripts analysis
113+
114+
3. **Analyze an infrastructure failure**:
115+
```
116+
/prow-job:analyze-install-failure https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift/installer/12345/pull-ci-openshift-installer-master-e2e-aws/7890
117+
```
118+
Expected output:
119+
- Identifies "install should succeed: infrastructure" failure
120+
- Focuses on installer log (no log bundle expected)
121+
- Reports: "Cloud quota exceeded for instance type m5.xlarge"
122+
- Recommends checking quota limits
123+
124+
4. **Analyze an operator stability failure**:
125+
```
126+
/prow-job:analyze-install-failure https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.21-e2e-gcp/1234567890123456789
127+
```
128+
Expected output:
129+
- Identifies "install should succeed: cluster operator stability" failure
130+
- Checks gather-must-gather for operator logs
131+
- Reports: "kube-apiserver operator stuck progressing"
132+
- Provides operator logs and status conditions
133+
134+
## Notes
135+
136+
- **Failure Modes**: The installer has multiple failure modes detected from junit_install.xml. Each mode requires different analysis approaches.
137+
- **Log Bundle**: Contains detailed node-level diagnostics including journals, serial consoles, and cluster API resources
138+
- **Metal Jobs**: Identified by "metal" in the job name. These jobs automatically invoke the specialized `prow-job-analyze-metal-install-failure` skill.
139+
- **Metal Artifacts**: Metal jobs analyze dev-scripts logs, libvirt console logs, sosreport, and squid logs
140+
- **Artifacts Location**: All downloaded artifacts are cached in `.work/prow-job-analyze-install-failure/{build_id}/` for faster re-analysis
141+
- **gcloud Requirement**: Requires gcloud CLI to be installed to access GCS buckets
142+
- **Public Access**: The test-platform-results bucket is publicly accessible - no authentication needed
143+
144+
## Arguments
145+
- **$1** (prowjob-url): The Prow job URL from prow.ci.openshift.org (required)

0 commit comments

Comments
 (0)