-
Notifications
You must be signed in to change notification settings - Fork 189
Add liveness probe to OCS Operator #3449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add liveness probe to OCS Operator #3449
Conversation
b374b75 to
93cadc4
Compare
b52ed49 to
866666e
Compare
866666e to
7b12b63
Compare
|
/hold as of requirements pending discussion |
7b12b63 to
d6114e4
Compare
d6114e4 to
5b51e7a
Compare
5b51e7a to
4230e4c
Compare
4230e4c to
68ac93e
Compare
|
/test ocs-operator-bundle-e2e-aws |
config/manager/manager.yaml
Outdated
| periodSeconds: 10 | ||
| timeoutSeconds: 5 | ||
| failureThreshold: 3 | ||
| startupProbe: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this? When it is the same as livenessProbe and you already set the initialDelaySeconds in the livenessProbe. Moreover, statup probes are for slow-starting containers, and ocs-operator is not a slow-starting container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iamniting done 4d62462
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not intoduce a new commit, Can you pls make the changes in the first commit itself. It does not make sense to delete something in the next commit that was introduced in the prev commit in the same PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iamniting I agree with you. I tested the first commit on an OCP cluster and verified that it works as expected with the private image. I just wanted to keep the working version from the first commit for now, and once I retest, I’ll combine the two commits into one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test pass on IBM cloud OCP 4.20.0-0.nightly-2025-10-30-114955
1.Create private image
export REGISTRY_NAMESPACE=oviner
export IMAGE_TAG=liveness-nov3
make ocs-operator
podman push quay.io/$REGISTRY_NAMESPACE/ocs-operator:$IMAGE_TAG
make ocs-metrics-exporter
podman push quay.io/$REGISTRY_NAMESPACE/ocs-metrics-exporter:$IMAGE_TAG
make operator-bundle
podman push quay.io/$REGISTRY_NAMESPACE/ocs-operator-bundle:$IMAGE_TAG
make operator-catalog
podman push quay.io/$REGISTRY_NAMESPACE/ocs-operator-catalog:$IMAGE_TAG
oc label nodes oviner5-ocs-jt5gb-worker-1-hbd44 cluster.ocs.openshift.io/openshift-storage=''
oc label nodes oviner5-ocs-jt5gb-worker-2-6mv5s cluster.ocs.openshift.io/openshift-storage=''
oc label nodes oviner5-ocs-jt5gb-worker-3-lxg9w cluster.ocs.openshift.io/openshift-storage=''
make install
2.Check opertor:
$ oc get csv -n openshift-storage
NAME DISPLAY VERSION REPLACES PHASE
cephcsi-operator.v4.21.0 CephCSI operator 4.21.0 Succeeded
csi-addons.v0.12.0 CSI Addons 0.12.0 Succeeded
noobaa-operator.v5.20.0 NooBaa Operator 5.20.0 Succeeded
ocs-client-operator.v4.21.0 OpenShift Data Foundation Client 4.21.0 Succeeded
ocs-operator.v4.21.0 OpenShift Container Storage 4.21.0 Succeeded
odf-external-snapshotter-operator.v4.20.0 Snapshot Controller 4.20.0 Succeeded
recipe.v0.0.1 Recipe 0.0.1 Succeeded
rook-ceph-operator.v4.20.0 Rook-Ceph 4.20.0 Succeeded
$ oc get pod -n openshift-storage ocs-operator-7d6d9497df-hdnns -o yaml| grep oviner
containerImage: quay.io/oviner/ocs-operator:liveness-nov3
value: quay.io/oviner/ocs-metrics-exporter:liveness-nov3
value: quay.io/oviner/ocs-operator:liveness-nov3
value: quay.io/oviner/ocs-operator:liveness-nov3
image: quay.io/oviner/ocs-operator:liveness-nov3
nodeName: oviner2-ocs-622qc-worker-3-x2smh
- Inspect current value in the CSV
$ oc -n openshift-storage get csv ocs-operator.v4.21.0 -o jsonpath='{.spec.install.spec.deployments[0].spec.template.spec.containers[0].livenessProbe.httpGet.path}{"\n"}'
/healthz
- Patch the CSV to force liveness failure
$ oc -n openshift-storage patch csv ocs-operator.v4.21.0 --type=json -p='[
{"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/livenessProbe/httpGet/path","value":"/bad"},
{"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/livenessProbe/initialDelaySeconds","value":5},
{"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/livenessProbe/periodSeconds","value":5},
{"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/livenessProbe/failureThreshold","value":1}
]'
clusterserviceversion.operators.coreos.com/ocs-operator.v4.21.0 patched
- Verify the Deployment picked it up and watch restarts
$ oc -n openshift-storage get pod -l name=ocs-operator -w
NAME READY STATUS RESTARTS AGE
ocs-operator-7f8fcf75c-s7nj5 0/1 Running 1 (9s ago) 19s
ocs-operator-7f8fcf75c-s7nj5 1/1 Running 1 (10s ago) 20s
ocs-operator-7f8fcf75c-s7nj5 0/1 Running 2 (2s ago) 22s
ocs-operator-7f8fcf75c-s7nj5 1/1 Running 2 (10s ago) 30s
ocs-operator-7f8fcf75c-s7nj5 0/1 Running 3 (2s ago) 32s
ocs-operator-7f8fcf75c-s7nj5 1/1 Running 3 (10s ago) 40s
ocs-operator-7f8fcf75c-s7nj5 0/1 CrashLoopBackOff 3 (1s ago) 41s
- Restore the CSV to the good settings
$ oc -n openshift-storage patch csv ocs-operator.v4.21.0 --type=json -p='[
{"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/livenessProbe/httpGet/path","value":"/healthz"},
{"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/livenessProbe/httpGet/port","value":"healthz"},
{"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/livenessProbe/initialDelaySeconds","value":15},
{"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/livenessProbe/periodSeconds","value":10},
{"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/livenessProbe/failureThreshold","value":3}
]'
clusterserviceversion.operators.coreos.com/ocs-operator.v4.21.0 patched
7.Check ocs-operator pod status
$ oc -n openshift-storage get pod -l name=ocs-operator
NAME READY STATUS RESTARTS AGE
ocs-operator-7d6d9497df-wqd5x 1/1 Running 0 101s
8.Inspect current readinessProbe in the CSV
$ oc -n openshift-storage get csv ocs-operator.v4.21.0 \
-o jsonpath='{.spec.install.spec.deployments[0].spec.template.spec.containers[0].readinessProbe.httpGet.path}{"\n"}'
/readyz
$ oc -n openshift-storage get csv ocs-operator.v4.21.0 \
-o jsonpath='{.spec.install.spec.deployments[0].spec.template.spec.containers[0].readinessProbe.httpGet.port}{"\n"}'
healthz
- Patch the CSV to force readiness failure:
$ oc -n openshift-storage patch csv ocs-operator.v4.21.0 --type=json -p='[
{"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/readinessProbe/httpGet/path","value":"/bad-ready"},
{"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/readinessProbe/initialDelaySeconds","value":5},
{"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/readinessProbe/periodSeconds","value":5},
{"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/readinessProbe/failureThreshold","value":1}
]'
clusterserviceversion.operators.coreos.com/ocs-operator.v4.21.0 patched
- Verify pod becomes NotReady but does not restart
$ oc -n openshift-storage get pod -l name=ocs-operator
NAME READY STATUS RESTARTS AGE
ocs-operator-86d4c9f999-fpb62 0/1 Running 0 2m24s
- Restore readinessProbe to the good settings
$ oc -n openshift-storage patch csv ocs-operator.v4.21.0 --type=json -p='[
{"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/readinessProbe/httpGet/path","value":"/readyz"},
{"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/readinessProbe/httpGet/port","value":"healthz"},
{"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/readinessProbe/initialDelaySeconds","value":5},
{"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/readinessProbe/periodSeconds","value":10},
{"op":"replace","path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/readinessProbe/failureThreshold","value":3}
]'
clusterserviceversion.operators.coreos.com/ocs-operator.v4.21.0 patched
12.Confirm the pod is Ready again: [take 3 min]
$ oc -n openshift-storage get pod -l name=ocs-operator
NAME READY STATUS RESTARTS AGE
ocs-operator-7d6d9497df-qn5l6 1/1 Running 0 17s
4d62462 to
6320b65
Compare
- Register /healthz endpoint in main.go - Add liveness, readiness, and startup probes to manager Deployment Signed-off-by: Oded Viner <[email protected]> drop startupprobe from manager deployment Signed-off-by: Oded Viner <[email protected]>
6320b65 to
ab79f7f
Compare
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: iamniting, OdedViner The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
c0b9f21
into
red-hat-storage:main
https://issues.redhat.com/browse/RHSTOR-7587
Tested privat image:
Test pass on IBM cloud OCP 4.20.0-0.nightly-2025-10-30-114955
1.Create private image
2.Check opertor:
7.Check ocs-operator pod status
8.Inspect current readinessProbe in the CSV
12.Confirm the pod is Ready again: [take 3 min]