The operator often breaks vmstorage cluster while updating their container image #1227

umezawatakeshi · 2025-01-24T07:53:54Z

After I restart the operator to update VictoriaMetrics components by giving new environment variables like VM_VMCLUSTERDEFAULT_VMSTORAGEDEFAULT_VERSION, the vmstorage cluster often breaks i.e. too many Pods are down during update. During update, vmstorage containers often restart and may be recreated several times. It seems that the operator restarts vmstorages quickly (about 7 or 8 seconds interval) but vmstorages terminate (or crash) soon.

vmstorage clusters in my production k8s clusters often break as written above but that in my development k8s cluster does not. They differ in the number of replicas (10 vs 5) and the PVs use (5TiB vs 4TiB). The size of the clusters (the number of nodes) also differ greatly.

Should I upload logs of the operator and vmstorages?

The text was updated successfully, but these errors were encountered:

f41gh7 · 2025-01-24T08:06:18Z

Hello, thanks for reporting! It'd be great to know a current version of operator. And indeed operator logs should help to investigate it further. I think, bug could be related to the rollingUpdate impementation https://github.com/VictoriaMetrics/operator/blob/master/internal/controller/operator/factory/reconcile/statefulset.go#L182 .

Possible current workaround for it - set spec.vmstorage.rollingUpdateStrategy: RollingUpdate. It disables operator implementation of StatefulSet updates and uses Kubernetes controller-manager for that.
It has some downsides, like manual interaction may be needed if Statefulset rollout process stuck due to misconfiguration.

umezawatakeshi · 2025-01-24T10:13:45Z

It'd be great to know a current version of operator

Oops, sorry.

operator: 0.49.1
VictoriaMetrics components: 1.109.0 -> 1.109.1 (this time)

And indeed operator logs should help to investigate it

20250124-vm-operator.log.txt
20250124-vmstorages.log.txt
20250124-vm-operator.yaml.txt
20250124-vmcluster.yaml.txt

umezawatakeshi · 2025-01-27T04:49:44Z

Additional information found just now: When I used operator v0.42.2, which was used before v0.49.1, and updated VictoriaMetrics from v1.100.1 to v1.102.0, the problem above did not occur. i.e. only one vmstorage Pod was down at the same time.

f41gh7 · 2025-01-30T15:40:08Z

Thanks for logs and configuration examples. Also, I forgot to ask, which version your kubernetes cluster runs on?

Operator uses revision field of the Statefulset.status to determine if pod should be updated or not. Kubernetes controller-manager adds revision of Statefulset to pod.labels[controller-revision-hash]. It serves as link between those two objects.

If Statefulset revision and Pod revision mismatch, operator deletes pod and waits for it re-creation.

According to the given log line:

check with updated but not ready pods - "desiredVersion":"vmstorage-vmcluster-largeset-55d88f8f"

Operator expects Pod to have this revision. But in fact pod has different revision after re-creation, added by controller-manager:

pod update finished with revision vmstorage-vmcluster-largeset-7c6fdc9c46

And it causes the issue. Maybe it's related to the some delay inside Kubernetes controller-manager, it updates state of the statefulset with delay from api-server.

For now, I think we should add a check - if pod was re-created with undesired revision - fail fast and return error. Retry it on the next cycle.

Previously, operator didn't check of pod-revision-hash value after Pod re-creation during StatefulSet rolling upgrade. It's responsobility of Kubernetes controller-manager to set proper revision-hash based on Revision of StatefulSet. But sometimes it may fail, due to network delays or performance issues at controller-manager. This commit adds revision check and fail fast in this case. It also removes StatefulSet Update of CurrentRevision field. It must be handeled by controller-manager and operator shouldn't change it. Related issue: #1227 Signed-off-by: f41gh7 <[email protected]>

umezawatakeshi · 2025-01-31T08:28:02Z

Also, I forgot to ask, which version your kubernetes cluster runs on?

I'm using k8s v1.29.7.

f41gh7 · 2025-02-06T13:58:25Z

Bugfix included into v0.53.0 release

f41gh7 added the bug Something isn't working label Jan 24, 2025

f41gh7 added the waiting for release The change was merged to upstream, but wasn't released yet. label Jan 30, 2025

f41gh7 removed the waiting for release The change was merged to upstream, but wasn't released yet. label Feb 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The operator often breaks vmstorage cluster while updating their container image #1227

The operator often breaks vmstorage cluster while updating their container image #1227

umezawatakeshi commented Jan 24, 2025

f41gh7 commented Jan 24, 2025

umezawatakeshi commented Jan 24, 2025

umezawatakeshi commented Jan 27, 2025 •

edited

Loading

f41gh7 commented Jan 30, 2025

umezawatakeshi commented Jan 31, 2025

f41gh7 commented Feb 6, 2025

The operator often breaks vmstorage cluster while updating their container image #1227

The operator often breaks vmstorage cluster while updating their container image #1227

Comments

umezawatakeshi commented Jan 24, 2025

f41gh7 commented Jan 24, 2025

umezawatakeshi commented Jan 24, 2025

umezawatakeshi commented Jan 27, 2025 • edited Loading

f41gh7 commented Jan 30, 2025

umezawatakeshi commented Jan 31, 2025

f41gh7 commented Feb 6, 2025

umezawatakeshi commented Jan 27, 2025 •

edited

Loading