Skip to content

Conversation

@dgoodwin
Copy link
Contributor

@dgoodwin dgoodwin commented Oct 31, 2025

Analyzes regression data for the GA window of a given release, optionally for a specific set of components. Attempts to give a grade via some parameters we'll probably have to adjust over time. Looks for things like number of regressions, how many got traiged to a jira, how long it took for them to get triaged, and how long for them to be closed.

Hope is for this plugin to expand to consider jira data and other factors.

Concerns raised here are not a direct critique of teams, but a tool we want to use to see if we're improving as we scale up efforts to improve product stability.

It offers nice conversational options now such as listing open regressions for a component, comparing how a component did across two releases, etc.

Sample command output for 4.20 overall:

Component Health Report: Release 4.20

Release Window: 2025-05-02 to 2025-10-21 (GA'd)
Total Regressions: 321 (76 suspected infra regressions filtered out)


Overall Health Grade: ❌ Poor

Metric Score Grade Details
Triage Coverage 31.8% ❌ Poor Only 102 of 321 regressions triaged
Triage Timeliness 87 hrs avg ⚠️ Needs Improvement ~3.6 days to triage (Max: 1657 hrs)
Resolution Speed 142 hrs avg ✅ Good ~5.9 days to close (Max: 2644 hrs)

Regression Breakdown

Status Total Triaged Triage %
Open 5 0 0.0%
Closed 316 102 32.3%

Critical: 5 open regressions remain untriaged and need immediate attention.


Component Health Scorecard

Components ranked from best to worst health:

Component Total Triage Coverage Triage Time (hrs) Resolution Time (hrs) Open Health Grade
Cloud Compute / Cloud Controller Manager 1 100.0% 138 168 0 ✅ Excellent
OLM 1 100.0% - - 0 ✅ Excellent
Pod Autoscaler 1 100.0% 193 252 0 ⚠️ Good
Build 5 80.0% 138 160 0 ⚠️ Good
Etcd 3 66.7% 1657 1217 0 ❌ Poor (slow triage)
Unknown 3 66.7% 28 72 0 ⚠️ Good
kube-apiserver 6 66.7% 8 182 0 ⚠️ Good
Node / Kubelet 8 62.5% 26 86 0 ⚠️ Good
HyperShift 74 58.1% 48 244 0 ⚠️ Needs Improvement
Networking / cluster-network-operator 25 56.0% 46 103 2 ⚠️ Needs Improvement
Cluster Version Operator 2 50.0% 46 50 0 ⚠️ Needs Improvement
Test Framework 7 42.9% 58 289 0 ⚠️ Needs Improvement
kube-controller-manager 3 33.3% - 56 0 ❌ Poor
Monitoring 28 25.0% 41 81 0 ❌ Poor
Installer / openshift-installer 49 22.4% 80 147 0 ❌ Poor
Storage 18 5.6% 39 52 0 ❌ Poor
Networking / router 21 4.8% 93 61 1 ❌ Poor
Cloud Compute / Unknown 12 0.0% - 10 0 ❌ Poor
Cloud Credential Operator 7 0.0% - 10 0 ❌ Poor
Cluster Autoscaler 6 0.0% - 11 0 ❌ Poor
Image Registry 8 0.0% - 21 0 ❌ Poor
Management Console 6 0.0% - 10 0 ❌ Poor
Networking / ovn-kubernetes 2 0.0% - - 2 ❌ Poor
oauth-apiserver 23 0.0% - 70 0 ❌ Poor
openshift-controller-manager / apps 2 0.0% - - 0 ❌ Poor

Components Needing Immediate Attention

Open Untriaged Regressions (Action Required):

  • Networking / cluster-network-operator: 2 open untriaged regressions
  • Networking / ovn-kubernetes: 2 open untriaged regressions
  • Networking / router: 1 open untriaged regression

Components with Zero Triage Coverage:
10 components had regressions but 0% were triaged:

  • oauth-apiserver (23 regressions)
  • Cloud Compute / Unknown (12 regressions)
  • Image Registry (8 regressions)
  • Cloud Credential Operator (7 regressions)
  • Cluster Autoscaler (6 regressions)
  • Management Console (6 regressions)
  • Networking / ovn-kubernetes (2 regressions)
  • openshift-controller-manager / apps (2 regressions)

Components with Poor Triage Coverage (<50%):

  • Installer / openshift-installer: 22.4% (49 total regressions)
  • Monitoring: 25.0% (28 total regressions)
  • Storage: 5.6% (18 total regressions)
  • Networking / router: 4.8% (21 total regressions)

Components with Very Slow Triage Times:

  • Etcd: 1657 hours average (~69 days) - extremely slow
  • Pod Autoscaler: 193 hours average (~8 days)
  • Build: 138 hours average (~5.75 days)

Key Findings

Strengths:

  • Good overall resolution speed (~6 days average)
  • Release has shipped with only 5 open regressions remaining
  • Several small components achieved 100% triage coverage

Critical Issues:

  1. Very poor triage coverage: Only 31.8% of regressions were triaged
    • 219 regressions (68.2%) were never linked to JIRA bugs
    • 10 components had 0% triage coverage
  2. Open regressions remain untriaged: 5 open regressions need immediate attention
  3. Inconsistent triage practices: Wide variance between components (0% to 100%)
  4. Slow triage for some components: Etcd took 69 days average to triage

Impact:

  • Difficult to track accountability for fixes
  • Cannot measure team performance effectively
  • Limited ability to learn from historical data
  • Risk management compromised without proper tracking

Recommendations

Immediate Actions:

  1. Triage the 5 open regressions across networking components
  2. Establish mandatory triage SLOs for all components

Process Improvements:

  1. Set organization-wide triage coverage target: >90%
  2. Implement triage timeliness SLO: <48 hours
  3. Consider requiring triage before allowing regression closure
  4. Provide training/tooling to components with 0% triage coverage
  5. Review Etcd's triage process to understand extreme delays

High-Priority Components for Follow-up:

  • Networking (multiple sub-components with issues)
  • Installer / openshift-installer (49 regressions, only 22.4% triaged)
  • oauth-apiserver (23 regressions, 0% triaged)
  • Monitoring (28 regressions, only 25% triaged)

@openshift-ci openshift-ci bot requested review from enxebre and zaneb October 31, 2025 14:49
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 31, 2025
@dgoodwin dgoodwin changed the title component health 1 Add a plugin for analzing health of components regression and triage response Oct 31, 2025
@dgoodwin
Copy link
Contributor Author

This is all vibe coded. I still need to manually review.

@dgoodwin
Copy link
Contributor Author

Cursory pass through markdown, I think it's good enough to start with and iterate on.

@dgoodwin dgoodwin changed the title Add a plugin for analzing health of components regression and triage response Add a plugin for analyzing health of components regression and triage response Oct 31, 2025
@stbenjam
Copy link
Member

/lgtm

@openshift-ci
Copy link

openshift-ci bot commented Oct 31, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgoodwin, stbenjam

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 31, 2025
@openshift-merge-bot openshift-merge-bot bot merged commit 68bc257 into openshift-eng:main Oct 31, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants