Add cluster-health-check command to openshift plugin #108

ropatil010 · 2025-10-31T07:05:13Z

Hi Team,

Can you PTAL.
PR about to check cluster health and report the error in text file.

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

openshift-ci · 2025-10-31T07:05:25Z

Hi @ropatil010. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ropatil010 · 2025-10-31T07:06:01Z

Report.txt

===============================================
OpenShift Cluster Health Check Report

Cluster Type: OpenShift
Cluster Version: 4.21.0-0.nightly-2025-10-30-125804
Check Time: 2025-10-31 12:15:00
Region: us-east-2 (AWS)
Cluster Age: ~2 hours old

===============================================
SUMMARY

✅ Critical Issues: 0
⚠️ Warnings: 4
ℹ️ Informational: Multiple startup events (expected for new cluster)

Overall Status: HEALTHY with minor warnings

===============================================
DETAILED FINDINGS

CLUSTER OPERATORS
Status: ✅ ALL HEALTHY
- All cluster operators are Available=True
- No operators in Degraded state
- No operators in Unavailable state
- No operators currently Progressing
NODE HEALTH
Status: ✅ ALL HEALTHY

Nodes Summary:
- Total Nodes: 6 (3 control-plane, 3 worker)
- All nodes in Ready state
- No nodes with scheduling disabled
- No resource pressure conditions (Memory/Disk/PID)
Node Resource Utilization:
Control Plane Nodes:
- ip-10-0-29-203: CPU 15%, Memory 61% (8976Mi)
- ip-10-0-40-225: CPU 10%, Memory 38% (5598Mi)
- ip-10-0-70-193: CPU 11%, Memory 42% (6207Mi)
Worker Nodes:
- ip-10-0-11-12: CPU 6%, Memory 24% (3592Mi)
- ip-10-0-37-183: CPU 2%, Memory 11% (1728Mi)
- ip-10-0-88-211: CPU 5%, Memory 26% (3863Mi)
POD HEALTH
Status: ⚠️ MINOR WARNINGS

Issues Found:
⚠️ WARNING: Pods with high restart count (>5):
- openshift-machine-config-operator/kube-rbac-proxy-crio-ip-10-0-11-12 [Restarts: 6]
- openshift-machine-config-operator/kube-rbac-proxy-crio-ip-10-0-37-183 [Restarts: 6]
- openshift-machine-config-operator/kube-rbac-proxy-crio-ip-10-0-88-211 [Restarts: 6]
Note: These are minor restarts (6 times) in the kube-rbac-proxy-crio pods,
which are common during initial cluster setup. No pods in CrashLoopBackOff
or failed state currently.
WORKLOAD CONTROLLERS
Status: ✅ ALL HEALTHY
- Deployments: All healthy, all replicas ready
- StatefulSets: All healthy, all replicas ready
- DaemonSets: All healthy, all desired pods running
STORAGE (PVCs)
Status: ✅ ALL HEALTHY
- All Persistent Volume Claims are in Bound state
- No pending or failed PVCs
CRITICAL NAMESPACES
Status: ✅ ALL HEALTHY

All critical OpenShift namespaces exist and healthy:
- openshift-kube-apiserver: ✅
- openshift-etcd: ✅
- openshift-authentication: ✅
- openshift-console: ✅
- openshift-monitoring: ✅
All pods in critical namespaces are Running/Succeeded.
RECENT EVENTS (Last 30 minutes)
Status: ℹ️ INFORMATIONAL

Recent warning events detected are mostly related to cluster startup:

⚠️ Marketplace Image Pull Issues (99m ago):
- openshift-marketplace/qe-app-registry pod failed to pull image from
  quay.io/openshift-qe-optional-operators/aosqe-index:v4.21
- Error: unauthorized access to the requested resource
- Impact: QE optional operators catalog unavailable
- Recommendation: Verify credentials for the optional operators registry
ℹ️ Startup Probes (110-111m ago):
- Normal startup probe failures during cluster initialization
- etcd and kube-apiserver pods showing temporary probe errors
- These are expected during cluster bootstrap
- All probes are now passing
ℹ️ Network Connectivity Checks (100-104m ago):
- Temporary connectivity timeouts detected during cluster setup
- All connectivity issues have resolved
ℹ️ Other startup warnings:
- CustomResourceDefinition already exists warnings (expected)
- Temporary readiness probe failures (resolved)

===============================================
RECOMMENDATIONS

✅ Cluster is generally healthy for a 2-hour-old installation
⚠️ Monitor kube-rbac-proxy-crio pods:
- Currently showing 6 restarts on worker nodes
- Watch for additional restarts with:
  oc get pods -n openshift-machine-config-operator -w
- If restarts continue, check logs:
  oc logs -n openshift-machine-config-operator
⚠️ Resolve QE App Registry authentication:
- The qe-app-registry pod cannot pull images due to auth issues
- If this catalog is needed, verify registry credentials:
  oc get secret -n openshift-marketplace
- Update pull secret if necessary:
  oc set data secret/pull-secret -n openshift-config --from-file=.dockerconfigjson=
ℹ️ Continue Monitoring:
- Cluster is in early stages (2 hours old)
- Continue to monitor for next 24-48 hours
- Check cluster operator status periodically:
  oc get clusteroperators
- Monitor resource utilization trends
✅ No immediate action required:
- All core components are healthy
- No critical failures detected
- Cluster is stable and operational

===============================================
NEXT STEPS

If deploying workloads:
- Cluster is ready for application deployment
- All critical services are operational
If troubleshooting:
- Most warning events are historical (from startup)
- No current critical issues require attention
- Focus on the marketplace authentication if QE catalog is needed
For ongoing monitoring:
- Run this health check periodically
- Set up alerts for cluster operator degradation
- Monitor node resource utilization trends

===============================================
HEALTH CHECK EXIT CODE: 0 (Success)

stbenjam · 2025-10-31T15:12:50Z

/ok-to-test

stbenjam · 2025-11-03T11:14:26Z

/lgtm

openshift-ci · 2025-11-03T11:14:33Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ropatil010, stbenjam

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [stbenjam]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Rohit Patil and others added 2 commits October 31, 2025 11:47

Add cluster-health-check command to openshift plugin

882b794

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

add deps

bbb3fcd

openshift-ci bot requested review from bryan-cox and stleerh October 31, 2025 07:05

openshift-ci bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 31, 2025

ropatil010 changed the title ~~Healthcheck~~ Cluster-Healthcheck Oct 31, 2025

ropatil010 changed the title ~~Cluster-Healthcheck~~ Add cluster-health-check command to openshift plugin Oct 31, 2025

openshift-ci bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 31, 2025

openshift-ci bot assigned stbenjam Nov 3, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 3, 2025

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 3, 2025

openshift-merge-bot bot merged commit 666856d into openshift-eng:main Nov 3, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add cluster-health-check command to openshift plugin #108

Add cluster-health-check command to openshift plugin #108

Uh oh!

ropatil010 commented Oct 31, 2025

Uh oh!

openshift-ci bot commented Oct 31, 2025

Uh oh!

ropatil010 commented Oct 31, 2025

Uh oh!

stbenjam commented Oct 31, 2025

Uh oh!

stbenjam commented Nov 3, 2025

Uh oh!

openshift-ci bot commented Nov 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add cluster-health-check command to openshift plugin #108

Add cluster-health-check command to openshift plugin #108

Uh oh!

Conversation

ropatil010 commented Oct 31, 2025

Uh oh!

openshift-ci bot commented Oct 31, 2025

Uh oh!

ropatil010 commented Oct 31, 2025

=============================================== OpenShift Cluster Health Check Report

=============================================== SUMMARY

=============================================== DETAILED FINDINGS

=============================================== RECOMMENDATIONS

=============================================== NEXT STEPS

=============================================== HEALTH CHECK EXIT CODE: 0 (Success)

Uh oh!

stbenjam commented Oct 31, 2025

Uh oh!

stbenjam commented Nov 3, 2025

Uh oh!

openshift-ci bot commented Nov 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

===============================================
OpenShift Cluster Health Check Report

===============================================
SUMMARY

===============================================
DETAILED FINDINGS

===============================================
RECOMMENDATIONS

===============================================
NEXT STEPS

===============================================
HEALTH CHECK EXIT CODE: 0 (Success)