-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The operator often breaks vmstorage cluster while updating their container image #1227
Comments
Hello, thanks for reporting! It'd be great to know a current version of operator. And indeed operator logs should help to investigate it further. I think, bug could be related to the rollingUpdate impementation https://github.com/VictoriaMetrics/operator/blob/master/internal/controller/operator/factory/reconcile/statefulset.go#L182 . Possible current workaround for it - set |
Oops, sorry.
20250124-vm-operator.log.txt |
Additional information found just now: When I used operator v0.42.2, which was used before v0.49.1, and updated VictoriaMetrics from v1.100.1 to v1.102.0, the problem above did not occur. i.e. only one vmstorage Pod was down at the same time. |
Thanks for logs and configuration examples. Also, I forgot to ask, which version your kubernetes cluster runs on? Operator uses If According to the given log line:
Operator expects
And it causes the issue. Maybe it's related to the some delay inside Kubernetes For now, I think we should add a check - if pod was re-created with undesired revision - fail fast and return error. Retry it on the next cycle. |
Previously, operator didn't check of pod-revision-hash value after Pod re-creation during StatefulSet rolling upgrade. It's responsobility of Kubernetes controller-manager to set proper revision-hash based on Revision of StatefulSet. But sometimes it may fail, due to network delays or performance issues at controller-manager. This commit adds revision check and fail fast in this case. It also removes StatefulSet Update of CurrentRevision field. It must be handeled by controller-manager and operator shouldn't change it. Related issue: #1227 Signed-off-by: f41gh7 <[email protected]>
I'm using k8s v1.29.7. |
Bugfix included into v0.53.0 release |
After I restart the operator to update VictoriaMetrics components by giving new environment variables like
VM_VMCLUSTERDEFAULT_VMSTORAGEDEFAULT_VERSION
, the vmstorage cluster often breaks i.e. too many Pods are down during update. During update, vmstorage containers often restart and may be recreated several times. It seems that the operator restarts vmstorages quickly (about 7 or 8 seconds interval) but vmstorages terminate (or crash) soon.vmstorage clusters in my production k8s clusters often break as written above but that in my development k8s cluster does not. They differ in the number of replicas (10 vs 5) and the PVs use (5TiB vs 4TiB). The size of the clusters (the number of nodes) also differ greatly.
Should I upload logs of the operator and vmstorages?
The text was updated successfully, but these errors were encountered: