Skip to content

Conversation

@ropatil010
Copy link

Hi Team,

Can you PTAL.
PR about to check cluster health and report the error in text file.

Rohit Patil and others added 2 commits October 31, 2025 11:47
@openshift-ci openshift-ci bot requested review from bryan-cox and stleerh October 31, 2025 07:05
@openshift-ci openshift-ci bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 31, 2025
@openshift-ci
Copy link

openshift-ci bot commented Oct 31, 2025

Hi @ropatil010. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ropatil010 ropatil010 changed the title Healthcheck Cluster-Healthcheck Oct 31, 2025
@ropatil010
Copy link
Author

Report.txt

===============================================
OpenShift Cluster Health Check Report

Cluster Type: OpenShift
Cluster Version: 4.21.0-0.nightly-2025-10-30-125804
Check Time: 2025-10-31 12:15:00
Region: us-east-2 (AWS)
Cluster Age: ~2 hours old

===============================================
SUMMARY

✅ Critical Issues: 0
⚠️ Warnings: 4
ℹ️ Informational: Multiple startup events (expected for new cluster)

Overall Status: HEALTHY with minor warnings

===============================================
DETAILED FINDINGS

  1. CLUSTER OPERATORS
    Status: ✅ ALL HEALTHY

    • All cluster operators are Available=True
    • No operators in Degraded state
    • No operators in Unavailable state
    • No operators currently Progressing
  2. NODE HEALTH
    Status: ✅ ALL HEALTHY

    Nodes Summary:

    • Total Nodes: 6 (3 control-plane, 3 worker)
    • All nodes in Ready state
    • No nodes with scheduling disabled
    • No resource pressure conditions (Memory/Disk/PID)

    Node Resource Utilization:
    Control Plane Nodes:

    • ip-10-0-29-203: CPU 15%, Memory 61% (8976Mi)
    • ip-10-0-40-225: CPU 10%, Memory 38% (5598Mi)
    • ip-10-0-70-193: CPU 11%, Memory 42% (6207Mi)

    Worker Nodes:

    • ip-10-0-11-12: CPU 6%, Memory 24% (3592Mi)
    • ip-10-0-37-183: CPU 2%, Memory 11% (1728Mi)
    • ip-10-0-88-211: CPU 5%, Memory 26% (3863Mi)
  3. POD HEALTH
    Status: ⚠️ MINOR WARNINGS

    Issues Found:
    ⚠️ WARNING: Pods with high restart count (>5):

    • openshift-machine-config-operator/kube-rbac-proxy-crio-ip-10-0-11-12 [Restarts: 6]
    • openshift-machine-config-operator/kube-rbac-proxy-crio-ip-10-0-37-183 [Restarts: 6]
    • openshift-machine-config-operator/kube-rbac-proxy-crio-ip-10-0-88-211 [Restarts: 6]

    Note: These are minor restarts (6 times) in the kube-rbac-proxy-crio pods,
    which are common during initial cluster setup. No pods in CrashLoopBackOff
    or failed state currently.

  4. WORKLOAD CONTROLLERS
    Status: ✅ ALL HEALTHY

    • Deployments: All healthy, all replicas ready
    • StatefulSets: All healthy, all replicas ready
    • DaemonSets: All healthy, all desired pods running
  5. STORAGE (PVCs)
    Status: ✅ ALL HEALTHY

    • All Persistent Volume Claims are in Bound state
    • No pending or failed PVCs
  6. CRITICAL NAMESPACES
    Status: ✅ ALL HEALTHY

    All critical OpenShift namespaces exist and healthy:

    • openshift-kube-apiserver: ✅
    • openshift-etcd: ✅
    • openshift-authentication: ✅
    • openshift-console: ✅
    • openshift-monitoring: ✅

    All pods in critical namespaces are Running/Succeeded.

  7. RECENT EVENTS (Last 30 minutes)
    Status: ℹ️ INFORMATIONAL

    Recent warning events detected are mostly related to cluster startup:

    ⚠️ Marketplace Image Pull Issues (99m ago):

    • openshift-marketplace/qe-app-registry pod failed to pull image from
      quay.io/openshift-qe-optional-operators/aosqe-index:v4.21
    • Error: unauthorized access to the requested resource
    • Impact: QE optional operators catalog unavailable
    • Recommendation: Verify credentials for the optional operators registry

    ℹ️ Startup Probes (110-111m ago):

    • Normal startup probe failures during cluster initialization
    • etcd and kube-apiserver pods showing temporary probe errors
    • These are expected during cluster bootstrap
    • All probes are now passing

    ℹ️ Network Connectivity Checks (100-104m ago):

    • Temporary connectivity timeouts detected during cluster setup
    • All connectivity issues have resolved

    ℹ️ Other startup warnings:

    • CustomResourceDefinition already exists warnings (expected)
    • Temporary readiness probe failures (resolved)

===============================================
RECOMMENDATIONS

  1. ✅ Cluster is generally healthy for a 2-hour-old installation

  2. ⚠️ Monitor kube-rbac-proxy-crio pods:

    • Currently showing 6 restarts on worker nodes
    • Watch for additional restarts with:
      oc get pods -n openshift-machine-config-operator -w
    • If restarts continue, check logs:
      oc logs -n openshift-machine-config-operator
  3. ⚠️ Resolve QE App Registry authentication:

    • The qe-app-registry pod cannot pull images due to auth issues
    • If this catalog is needed, verify registry credentials:
      oc get secret -n openshift-marketplace
    • Update pull secret if necessary:
      oc set data secret/pull-secret -n openshift-config --from-file=.dockerconfigjson=
  4. ℹ️ Continue Monitoring:

    • Cluster is in early stages (2 hours old)
    • Continue to monitor for next 24-48 hours
    • Check cluster operator status periodically:
      oc get clusteroperators
    • Monitor resource utilization trends
  5. ✅ No immediate action required:

    • All core components are healthy
    • No critical failures detected
    • Cluster is stable and operational

===============================================
NEXT STEPS

  1. If deploying workloads:

    • Cluster is ready for application deployment
    • All critical services are operational
  2. If troubleshooting:

    • Most warning events are historical (from startup)
    • No current critical issues require attention
    • Focus on the marketplace authentication if QE catalog is needed
  3. For ongoing monitoring:

    • Run this health check periodically
    • Set up alerts for cluster operator degradation
    • Monitor node resource utilization trends

===============================================
HEALTH CHECK EXIT CODE: 0 (Success)

@ropatil010 ropatil010 changed the title Cluster-Healthcheck Add cluster-health-check command to openshift plugin Oct 31, 2025
@stbenjam
Copy link
Member

/ok-to-test

@openshift-ci openshift-ci bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 31, 2025
@stbenjam
Copy link
Member

stbenjam commented Nov 3, 2025

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 3, 2025
@openshift-ci
Copy link

openshift-ci bot commented Nov 3, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ropatil010, stbenjam

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 3, 2025
@openshift-merge-bot openshift-merge-bot bot merged commit 666856d into openshift-eng:main Nov 3, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants