Skip to content

Conversation

@OdedViner
Copy link
Contributor

@OdedViner OdedViner commented Aug 19, 2025

  • Register /healthz endpoint in main.go
  • Add liveness, readiness, and startup probes to manager Deployment

https://issues.redhat.com/browse/RHSTOR-7587

Tested privat image:
Test pass on IBM cloud OCP 4.20.0-0.nightly-2025-10-30-114955

1.Create private image

export REGISTRY_NAMESPACE=oviner
export IMAGE_TAG=liveness-nov3
make ocs-operator
podman push quay.io/$REGISTRY_NAMESPACE/ocs-operator:$IMAGE_TAG
make ocs-metrics-exporter
podman push quay.io/$REGISTRY_NAMESPACE/ocs-metrics-exporter:$IMAGE_TAG
make operator-bundle
podman push quay.io/$REGISTRY_NAMESPACE/ocs-operator-bundle:$IMAGE_TAG
make operator-catalog
podman push quay.io/$REGISTRY_NAMESPACE/ocs-operator-catalog:$IMAGE_TAG
oc label nodes oviner5-ocs-jt5gb-worker-1-hbd44 cluster.ocs.openshift.io/openshift-storage=''
oc label nodes oviner5-ocs-jt5gb-worker-2-6mv5s cluster.ocs.openshift.io/openshift-storage=''
oc label nodes oviner5-ocs-jt5gb-worker-3-lxg9w cluster.ocs.openshift.io/openshift-storage=''
make install

2.Check opertor:

$ oc get csv -n openshift-storage 
NAME                                        DISPLAY                            VERSION   REPLACES   PHASE
cephcsi-operator.v4.21.0                    CephCSI operator                   4.21.0               Succeeded
csi-addons.v0.12.0                          CSI Addons                         0.12.0               Succeeded
noobaa-operator.v5.20.0                     NooBaa Operator                    5.20.0               Succeeded
ocs-client-operator.v4.21.0                 OpenShift Data Foundation Client   4.21.0               Succeeded
ocs-operator.v4.21.0                        OpenShift Container Storage        4.21.0               Succeeded
odf-external-snapshotter-operator.v4.20.0   Snapshot Controller                4.20.0               Succeeded
recipe.v0.0.1                               Recipe                             0.0.1                Succeeded
rook-ceph-operator.v4.20.0                  Rook-Ceph                          4.20.0               Succeeded

$ oc get pod -n openshift-storage ocs-operator-7d6d9497df-hdnns -o yaml| grep oviner
    containerImage: quay.io/oviner/ocs-operator:liveness-nov3
      value: quay.io/oviner/ocs-metrics-exporter:liveness-nov3
      value: quay.io/oviner/ocs-operator:liveness-nov3
      value: quay.io/oviner/ocs-operator:liveness-nov3
    image: quay.io/oviner/ocs-operator:liveness-nov3
  nodeName: oviner2-ocs-622qc-worker-3-x2smh
  1. Inspect current value in the CSV
$ oc -n openshift-storage get csv ocs-operator.v4.21.0   -o jsonpath='{.spec.install.spec.deployments[0].spec.template.spec.containers[0].livenessProbe.httpGet.path}{"\n"}'
/healthz
  1. Patch the CSV to force liveness failure
$ oc -n openshift-storage patch csv ocs-operator.v4.21.0 --type=json -p='[
  {"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/livenessProbe/httpGet/path","value":"/bad"},
  {"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/livenessProbe/initialDelaySeconds","value":5},
  {"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/livenessProbe/periodSeconds","value":5},
  {"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/livenessProbe/failureThreshold","value":1}
]'
clusterserviceversion.operators.coreos.com/ocs-operator.v4.21.0 patched
  1. Verify the Deployment picked it up and watch restarts
$  oc -n openshift-storage get pod -l name=ocs-operator -w
NAME                           READY   STATUS    RESTARTS     AGE
ocs-operator-7f8fcf75c-s7nj5   0/1     Running   1 (9s ago)   19s
ocs-operator-7f8fcf75c-s7nj5   1/1     Running   1 (10s ago)   20s
ocs-operator-7f8fcf75c-s7nj5   0/1     Running   2 (2s ago)    22s
ocs-operator-7f8fcf75c-s7nj5   1/1     Running   2 (10s ago)   30s
ocs-operator-7f8fcf75c-s7nj5   0/1     Running   3 (2s ago)    32s
ocs-operator-7f8fcf75c-s7nj5   1/1     Running   3 (10s ago)   40s
ocs-operator-7f8fcf75c-s7nj5   0/1     CrashLoopBackOff   3 (1s ago)    41s
  1. Restore the CSV to the good settings
$ oc -n openshift-storage patch csv ocs-operator.v4.21.0 --type=json -p='[
  {"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/livenessProbe/httpGet/path","value":"/healthz"},
  {"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/livenessProbe/httpGet/port","value":"healthz"},
  {"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/livenessProbe/initialDelaySeconds","value":15},
  {"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/livenessProbe/periodSeconds","value":10},
  {"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/livenessProbe/failureThreshold","value":3}
]'
clusterserviceversion.operators.coreos.com/ocs-operator.v4.21.0 patched

7.Check ocs-operator pod status

$ oc -n openshift-storage get pod -l name=ocs-operator 
NAME                            READY   STATUS    RESTARTS   AGE
ocs-operator-7d6d9497df-wqd5x   1/1     Running   0          101s

8.Inspect current readinessProbe in the CSV

$ oc -n openshift-storage get csv ocs-operator.v4.21.0 \
  -o jsonpath='{.spec.install.spec.deployments[0].spec.template.spec.containers[0].readinessProbe.httpGet.path}{"\n"}'
/readyz


$ oc -n openshift-storage get csv ocs-operator.v4.21.0 \
  -o jsonpath='{.spec.install.spec.deployments[0].spec.template.spec.containers[0].readinessProbe.httpGet.port}{"\n"}'
healthz
  1. Patch the CSV to force readiness failure:
$ oc -n openshift-storage patch csv ocs-operator.v4.21.0 --type=json -p='[
  {"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/readinessProbe/httpGet/path","value":"/bad-ready"},
  {"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/readinessProbe/initialDelaySeconds","value":5},
  {"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/readinessProbe/periodSeconds","value":5},
  {"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/readinessProbe/failureThreshold","value":1}
]'
clusterserviceversion.operators.coreos.com/ocs-operator.v4.21.0 patched

  1. Verify pod becomes NotReady but does not restart
$ oc -n openshift-storage get pod -l name=ocs-operator 
NAME                            READY   STATUS    RESTARTS   AGE
ocs-operator-86d4c9f999-fpb62   0/1     Running   0          2m24s

  1. Restore readinessProbe to the good settings
$ oc -n openshift-storage patch csv ocs-operator.v4.21.0 --type=json -p='[
  {"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/readinessProbe/httpGet/path","value":"/readyz"},
  {"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/readinessProbe/httpGet/port","value":"healthz"},
  {"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/readinessProbe/initialDelaySeconds","value":5},
  {"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/readinessProbe/periodSeconds","value":10},
  {"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/readinessProbe/failureThreshold","value":3}
]'
clusterserviceversion.operators.coreos.com/ocs-operator.v4.21.0 patched

12.Confirm the pod is Ready again: [take 3 min]

$ oc -n openshift-storage get pod -l name=ocs-operator 
NAME                            READY   STATUS    RESTARTS   AGE
ocs-operator-7d6d9497df-qn5l6   1/1     Running   0          17s

@OdedViner OdedViner force-pushed the add_liveness_probe branch 2 times, most recently from b374b75 to 93cadc4 Compare August 19, 2025 15:08
@iamniting iamniting requested a review from nb-ohad August 21, 2025 06:22
@OdedViner OdedViner force-pushed the add_liveness_probe branch 2 times, most recently from b52ed49 to 866666e Compare August 21, 2025 09:07
@OdedViner OdedViner requested a review from iamniting August 21, 2025 13:44
@nb-ohad
Copy link
Contributor

nb-ohad commented Aug 28, 2025

/hold as of requirements pending discussion

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 28, 2025
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 6, 2025
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 9, 2025
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 14, 2025
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 14, 2025
@iamniting
Copy link
Member

/test ocs-operator-bundle-e2e-aws

@iamniting
Copy link
Member

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 28, 2025
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
startupProbe:
Copy link
Member

@iamniting iamniting Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this? When it is the same as livenessProbe and you already set the initialDelaySeconds in the livenessProbe. Moreover, statup probes are for slow-starting containers, and ocs-operator is not a slow-starting container.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not intoduce a new commit, Can you pls make the changes in the first commit itself. It does not make sense to delete something in the next commit that was introduced in the prev commit in the same PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iamniting I agree with you. I tested the first commit on an OCP cluster and verified that it works as expected with the private image. I just wanted to keep the working version from the first commit for now, and once I retest, I’ll combine the two commits into one.

Copy link
Contributor Author

@OdedViner OdedViner Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test pass on IBM cloud OCP 4.20.0-0.nightly-2025-10-30-114955

1.Create private image

export REGISTRY_NAMESPACE=oviner
export IMAGE_TAG=liveness-nov3
make ocs-operator
podman push quay.io/$REGISTRY_NAMESPACE/ocs-operator:$IMAGE_TAG
make ocs-metrics-exporter
podman push quay.io/$REGISTRY_NAMESPACE/ocs-metrics-exporter:$IMAGE_TAG
make operator-bundle
podman push quay.io/$REGISTRY_NAMESPACE/ocs-operator-bundle:$IMAGE_TAG
make operator-catalog
podman push quay.io/$REGISTRY_NAMESPACE/ocs-operator-catalog:$IMAGE_TAG
oc label nodes oviner5-ocs-jt5gb-worker-1-hbd44 cluster.ocs.openshift.io/openshift-storage=''
oc label nodes oviner5-ocs-jt5gb-worker-2-6mv5s cluster.ocs.openshift.io/openshift-storage=''
oc label nodes oviner5-ocs-jt5gb-worker-3-lxg9w cluster.ocs.openshift.io/openshift-storage=''
make install

2.Check opertor:

$ oc get csv -n openshift-storage 
NAME                                        DISPLAY                            VERSION   REPLACES   PHASE
cephcsi-operator.v4.21.0                    CephCSI operator                   4.21.0               Succeeded
csi-addons.v0.12.0                          CSI Addons                         0.12.0               Succeeded
noobaa-operator.v5.20.0                     NooBaa Operator                    5.20.0               Succeeded
ocs-client-operator.v4.21.0                 OpenShift Data Foundation Client   4.21.0               Succeeded
ocs-operator.v4.21.0                        OpenShift Container Storage        4.21.0               Succeeded
odf-external-snapshotter-operator.v4.20.0   Snapshot Controller                4.20.0               Succeeded
recipe.v0.0.1                               Recipe                             0.0.1                Succeeded
rook-ceph-operator.v4.20.0                  Rook-Ceph                          4.20.0               Succeeded

$ oc get pod -n openshift-storage ocs-operator-7d6d9497df-hdnns -o yaml| grep oviner
    containerImage: quay.io/oviner/ocs-operator:liveness-nov3
      value: quay.io/oviner/ocs-metrics-exporter:liveness-nov3
      value: quay.io/oviner/ocs-operator:liveness-nov3
      value: quay.io/oviner/ocs-operator:liveness-nov3
    image: quay.io/oviner/ocs-operator:liveness-nov3
  nodeName: oviner2-ocs-622qc-worker-3-x2smh
  1. Inspect current value in the CSV
$ oc -n openshift-storage get csv ocs-operator.v4.21.0   -o jsonpath='{.spec.install.spec.deployments[0].spec.template.spec.containers[0].livenessProbe.httpGet.path}{"\n"}'
/healthz
  1. Patch the CSV to force liveness failure
$ oc -n openshift-storage patch csv ocs-operator.v4.21.0 --type=json -p='[
  {"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/livenessProbe/httpGet/path","value":"/bad"},
  {"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/livenessProbe/initialDelaySeconds","value":5},
  {"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/livenessProbe/periodSeconds","value":5},
  {"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/livenessProbe/failureThreshold","value":1}
]'
clusterserviceversion.operators.coreos.com/ocs-operator.v4.21.0 patched
  1. Verify the Deployment picked it up and watch restarts
$  oc -n openshift-storage get pod -l name=ocs-operator -w
NAME                           READY   STATUS    RESTARTS     AGE
ocs-operator-7f8fcf75c-s7nj5   0/1     Running   1 (9s ago)   19s
ocs-operator-7f8fcf75c-s7nj5   1/1     Running   1 (10s ago)   20s
ocs-operator-7f8fcf75c-s7nj5   0/1     Running   2 (2s ago)    22s
ocs-operator-7f8fcf75c-s7nj5   1/1     Running   2 (10s ago)   30s
ocs-operator-7f8fcf75c-s7nj5   0/1     Running   3 (2s ago)    32s
ocs-operator-7f8fcf75c-s7nj5   1/1     Running   3 (10s ago)   40s
ocs-operator-7f8fcf75c-s7nj5   0/1     CrashLoopBackOff   3 (1s ago)    41s
  1. Restore the CSV to the good settings
$ oc -n openshift-storage patch csv ocs-operator.v4.21.0 --type=json -p='[
  {"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/livenessProbe/httpGet/path","value":"/healthz"},
  {"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/livenessProbe/httpGet/port","value":"healthz"},
  {"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/livenessProbe/initialDelaySeconds","value":15},
  {"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/livenessProbe/periodSeconds","value":10},
  {"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/livenessProbe/failureThreshold","value":3}
]'
clusterserviceversion.operators.coreos.com/ocs-operator.v4.21.0 patched

7.Check ocs-operator pod status

$ oc -n openshift-storage get pod -l name=ocs-operator 
NAME                            READY   STATUS    RESTARTS   AGE
ocs-operator-7d6d9497df-wqd5x   1/1     Running   0          101s

8.Inspect current readinessProbe in the CSV

$ oc -n openshift-storage get csv ocs-operator.v4.21.0 \
  -o jsonpath='{.spec.install.spec.deployments[0].spec.template.spec.containers[0].readinessProbe.httpGet.path}{"\n"}'
/readyz


$ oc -n openshift-storage get csv ocs-operator.v4.21.0 \
  -o jsonpath='{.spec.install.spec.deployments[0].spec.template.spec.containers[0].readinessProbe.httpGet.port}{"\n"}'
healthz
  1. Patch the CSV to force readiness failure:
$ oc -n openshift-storage patch csv ocs-operator.v4.21.0 --type=json -p='[
  {"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/readinessProbe/httpGet/path","value":"/bad-ready"},
  {"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/readinessProbe/initialDelaySeconds","value":5},
  {"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/readinessProbe/periodSeconds","value":5},
  {"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/readinessProbe/failureThreshold","value":1}
]'
clusterserviceversion.operators.coreos.com/ocs-operator.v4.21.0 patched

  1. Verify pod becomes NotReady but does not restart
$ oc -n openshift-storage get pod -l name=ocs-operator 
NAME                            READY   STATUS    RESTARTS   AGE
ocs-operator-86d4c9f999-fpb62   0/1     Running   0          2m24s

  1. Restore readinessProbe to the good settings
$ oc -n openshift-storage patch csv ocs-operator.v4.21.0 --type=json -p='[
  {"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/readinessProbe/httpGet/path","value":"/readyz"},
  {"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/readinessProbe/httpGet/port","value":"healthz"},
  {"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/readinessProbe/initialDelaySeconds","value":5},
  {"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/readinessProbe/periodSeconds","value":10},
  {"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/readinessProbe/failureThreshold","value":3}
]'
clusterserviceversion.operators.coreos.com/ocs-operator.v4.21.0 patched

12.Confirm the pod is Ready again: [take 3 min]

$ oc -n openshift-storage get pod -l name=ocs-operator 
NAME                            READY   STATUS    RESTARTS   AGE
ocs-operator-7d6d9497df-qn5l6   1/1     Running   0          17s

@iamniting iamniting self-requested a review October 30, 2025 13:27
- Register /healthz endpoint in main.go
- Add liveness, readiness, and startup probes to manager Deployment

Signed-off-by: Oded Viner <[email protected]>

drop startupprobe from manager deployment

Signed-off-by: Oded Viner <[email protected]>
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 4, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 4, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: iamniting, OdedViner

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 4, 2025
@openshift-merge-bot openshift-merge-bot bot merged commit c0b9f21 into red-hat-storage:main Nov 4, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants