Add a plugin for analyzing health of components regression and triage response #114

dgoodwin · 2025-10-31T14:49:26Z

Analyzes regression data for the GA window of a given release, optionally for a specific set of components. Attempts to give a grade via some parameters we'll probably have to adjust over time. Looks for things like number of regressions, how many got traiged to a jira, how long it took for them to get triaged, and how long for them to be closed.

Hope is for this plugin to expand to consider jira data and other factors.

Concerns raised here are not a direct critique of teams, but a tool we want to use to see if we're improving as we scale up efforts to improve product stability.

It offers nice conversational options now such as listing open regressions for a component, comparing how a component did across two releases, etc.

Sample command output for 4.20 overall:

Component Health Report: Release 4.20

Release Window: 2025-05-02 to 2025-10-21 (GA'd)
Total Regressions: 321 (76 suspected infra regressions filtered out)

Overall Health Grade: ❌ Poor

Metric	Score	Grade	Details
Triage Coverage	31.8%	❌ Poor	Only 102 of 321 regressions triaged
Triage Timeliness	87 hrs avg	⚠️ Needs Improvement	~3.6 days to triage (Max: 1657 hrs)
Resolution Speed	142 hrs avg	✅ Good	~5.9 days to close (Max: 2644 hrs)

Regression Breakdown

Status	Total	Triaged	Triage %
Open	5	0	0.0%
Closed	316	102	32.3%

Critical: 5 open regressions remain untriaged and need immediate attention.

Component Health Scorecard

Components ranked from best to worst health:

Component	Total	Triage Coverage	Triage Time (hrs)	Resolution Time (hrs)	Open	Health Grade
Cloud Compute / Cloud Controller Manager	1	100.0%	138	168	0	✅ Excellent
OLM	1	100.0%	-	-	0	✅ Excellent
Pod Autoscaler	1	100.0%	193	252	0	⚠️ Good
Build	5	80.0%	138	160	0	⚠️ Good
Etcd	3	66.7%	1657	1217	0	❌ Poor (slow triage)
Unknown	3	66.7%	28	72	0	⚠️ Good
kube-apiserver	6	66.7%	8	182	0	⚠️ Good
Node / Kubelet	8	62.5%	26	86	0	⚠️ Good
HyperShift	74	58.1%	48	244	0	⚠️ Needs Improvement
Networking / cluster-network-operator	25	56.0%	46	103	2	⚠️ Needs Improvement
Cluster Version Operator	2	50.0%	46	50	0	⚠️ Needs Improvement
Test Framework	7	42.9%	58	289	0	⚠️ Needs Improvement
kube-controller-manager	3	33.3%	-	56	0	❌ Poor
Monitoring	28	25.0%	41	81	0	❌ Poor
Installer / openshift-installer	49	22.4%	80	147	0	❌ Poor
Storage	18	5.6%	39	52	0	❌ Poor
Networking / router	21	4.8%	93	61	1	❌ Poor
Cloud Compute / Unknown	12	0.0%	-	10	0	❌ Poor
Cloud Credential Operator	7	0.0%	-	10	0	❌ Poor
Cluster Autoscaler	6	0.0%	-	11	0	❌ Poor
Image Registry	8	0.0%	-	21	0	❌ Poor
Management Console	6	0.0%	-	10	0	❌ Poor
Networking / ovn-kubernetes	2	0.0%	-	-	2	❌ Poor
oauth-apiserver	23	0.0%	-	70	0	❌ Poor
openshift-controller-manager / apps	2	0.0%	-	-	0	❌ Poor

Components Needing Immediate Attention

Open Untriaged Regressions (Action Required):

Networking / cluster-network-operator: 2 open untriaged regressions
Networking / ovn-kubernetes: 2 open untriaged regressions
Networking / router: 1 open untriaged regression

Components with Zero Triage Coverage:
10 components had regressions but 0% were triaged:

oauth-apiserver (23 regressions)
Cloud Compute / Unknown (12 regressions)
Image Registry (8 regressions)
Cloud Credential Operator (7 regressions)
Cluster Autoscaler (6 regressions)
Management Console (6 regressions)
Networking / ovn-kubernetes (2 regressions)
openshift-controller-manager / apps (2 regressions)

Components with Poor Triage Coverage (<50%):

Installer / openshift-installer: 22.4% (49 total regressions)
Monitoring: 25.0% (28 total regressions)
Storage: 5.6% (18 total regressions)
Networking / router: 4.8% (21 total regressions)

Components with Very Slow Triage Times:

Etcd: 1657 hours average (~69 days) - extremely slow
Pod Autoscaler: 193 hours average (~8 days)
Build: 138 hours average (~5.75 days)

Key Findings

Strengths:

Good overall resolution speed (~6 days average)
Release has shipped with only 5 open regressions remaining
Several small components achieved 100% triage coverage

Critical Issues:

Very poor triage coverage: Only 31.8% of regressions were triaged
- 219 regressions (68.2%) were never linked to JIRA bugs
- 10 components had 0% triage coverage
Open regressions remain untriaged: 5 open regressions need immediate attention
Inconsistent triage practices: Wide variance between components (0% to 100%)
Slow triage for some components: Etcd took 69 days average to triage

Impact:

Difficult to track accountability for fixes
Cannot measure team performance effectively
Limited ability to learn from historical data
Risk management compromised without proper tracking

Recommendations

Immediate Actions:

Triage the 5 open regressions across networking components
Establish mandatory triage SLOs for all components

Process Improvements:

Set organization-wide triage coverage target: >90%
Implement triage timeliness SLO: <48 hours
Consider requiring triage before allowing regression closure
Provide training/tooling to components with 0% triage coverage
Review Etcd's triage process to understand extreme delays

High-Priority Components for Follow-up:

Networking (multiple sub-components with issues)
Installer / openshift-installer (49 regressions, only 22.4% triaged)
oauth-apiserver (23 regressions, 0% triaged)
Monitoring (28 regressions, only 25% triaged)

dgoodwin · 2025-10-31T14:54:43Z

This is all vibe coded. I still need to manually review.

…each

…n read it all

…ured on date)

dgoodwin · 2025-10-31T17:46:05Z

Cursory pass through markdown, I think it's good enough to start with and iterate on.

stbenjam · 2025-10-31T18:00:04Z

/lgtm

openshift-ci · 2025-10-31T18:00:11Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgoodwin, stbenjam

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [dgoodwin,stbenjam]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci bot requested review from enxebre and zaneb October 31, 2025 14:49

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 31, 2025

dgoodwin changed the title ~~component health 1~~ Add a plugin for analzing health of components regression and triage response Oct 31, 2025

dgoodwin added 25 commits October 31, 2025 14:45

Add a component-health plugin for reporting on regressions and jira data

6a970fa

Component filtering

e524188

Simplify closed and last_failure timestamps from the API

3c62fec

Try to fix counting open regressions halucination

9de1273

Rename to analyze-regressions command

0bdf98f

Improve script response to sort by component and include summary for …

d827b11

…each

DRY for the summary calculation

9425e52

Count triaged regressions

bb274ed

More efficient looping

79fb660

Calculate time to triage as best we can

b6bd31a

Time to closed hrs avg

0940aa0

Track max times as well, and regression open avg and max

492cb7d

Pickup missed file

9b61135

Track time from triaged to closed

a48f7e1

Grade components based on regression data

49323ae

Trim response size a little

8c6f7ad

Slight tweaks to grading criteria

a02b176

Add skill to lookup release dates

099691a

Use release date filtering in the regression health analyzer

0762ce2

Short output option for listing regressions so the analyze command ca…

7e0d926

…n read it all

Offer to generate an html report after analyzing an entire release

75d79c0

Add html report template

410b39e

Crude attempt to filter our infra regressions (short lived, mass clos…

b4f08c4

…ured on date)

Manual review

bbdc655

Update top level plugins doc

6678c95

dgoodwin added 2 commits October 31, 2025 14:45

Add missing skill.md

eafc101

Another make update

171d089

dgoodwin force-pushed the component-health-1 branch from 131aa8b to 171d089 Compare October 31, 2025 17:45

dgoodwin changed the title ~~Add a plugin for analzing health of components regression and triage response~~ Add a plugin for analyzing health of components regression and triage response Oct 31, 2025

openshift-ci bot assigned stbenjam Oct 31, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 31, 2025

openshift-merge-bot bot merged commit 68bc257 into openshift-eng:main Oct 31, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add a plugin for analyzing health of components regression and triage response #114

Add a plugin for analyzing health of components regression and triage response #114

Uh oh!

dgoodwin commented Oct 31, 2025 •

edited

Loading

Uh oh!

dgoodwin commented Oct 31, 2025

Uh oh!

dgoodwin commented Oct 31, 2025

Uh oh!

stbenjam commented Oct 31, 2025

Uh oh!

openshift-ci bot commented Oct 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add a plugin for analyzing health of components regression and triage response #114

Add a plugin for analyzing health of components regression and triage response #114

Uh oh!

Conversation

dgoodwin commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Component Health Report: Release 4.20

Overall Health Grade: ❌ Poor

Regression Breakdown

Component Health Scorecard

Components Needing Immediate Attention

Key Findings

Recommendations

Uh oh!

dgoodwin commented Oct 31, 2025

Uh oh!

dgoodwin commented Oct 31, 2025

Uh oh!

stbenjam commented Oct 31, 2025

Uh oh!

openshift-ci bot commented Oct 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dgoodwin commented Oct 31, 2025 •

edited

Loading