VMDistributedCluster CR: orchestrate VMCluster upgrades #1556

vrutkovs · 2025-10-21T07:27:53Z

Add a new CR - VMDistributedCluster -so that multiple VMClusters could be upgraded in orchestrated fashion, ensuring the read VMAuth is disabled before upgrade and VMAgent (if available) doesn't have pending bytes to send.

See #1515 (comment) for agreed limitations for v1alpha1 version.

Fixes #1515

TODO:

Add changelog entry
Description-less CRD should be applied for development only. Rephrase descriptions in existing parts to make it fit for production
Fix flaking tests
Squash commits
Keeping original commits for review as its useful to show how the feature was developed
Allow objects from other namespaces?
During development I realized VMAuth / VMClusters may be in different namespaces. The initial version requires all objects to be in the same namespace as VMDistributedCluster. Do we want to keep it that way for API simplicity or worth having the initial version with cross-namespace support?
Other improvements:
- Multiple VMAgents per VMCluster - use label selector instead of name?
- Label selector for VMClusters? Not sure how to select VMAgents then

config/rbac/role.yaml

config/rbac/operator_vmdistributedcluster_editor_role.yaml

AndrewChubatiuk · 2025-10-21T07:49:28Z

initially thought distributed CR is needed for full distributed setup management, but looks like it only performs version upgrade. In this case just curious why we need different CRs for VM, VT and VL?

vrutkovs · 2025-10-21T09:05:09Z

Yes, so far we're focusing on upgrades - existing CRs provide sufficient flexibility IMO - and we didn't get a request for other actions so far.

In this case just curious why we need different CRs for VM, VT and VL?

VL and VT don't have agents (yet) so their specs would be different. However we can reuse the same approach and probably even some helper functions

Haleygo

Yes, so far we're focusing on upgrades - existing CRs provide sufficient flexibility IMO - and we didn't get a request for other actions so far.

I believe users would expect to modify the vmcluster spec value or apply extra flags to the vmclusters.
And since vmclusterSpec.ClusterVersion is optional, users could specify component versions inside vmclusterSpec which overrides the vmclusterSpec.ClusterVersion.

And currently, it seems VMDistributedCluster only covers a limited scenario where resources like vmcluster, vmuser, vmauth are defined and configured as needed.
Could you please provide an example of how to config them to achieve similar topology described in victoria-metrics-distributed chart? I expect VMDistributedCluster to be supported there when released.

Haleygo · 2025-10-23T17:24:50Z

internal/controller/operator/factory/vmdistributedcluster/vmdistributedcluster.go

+		if vmClusterAgentPair.VMAgent != nil {
+			vmAgent = &vmAgentAdapter{VMAgent: vmClusterAgentPair.VMAgent}
+		}
+		waitForVMClusterVMAgentMetrics(ctx, httpClient, vmAgent, deadline)


We do not need to check the vmagent persistent queue here; it should occur after updating the vmcluster and before resuming vmuser reading from the vmcluster, which ensures data integrity on the upgraded vmcluster.

Oh, so I should move it after waitForVMClusterReady, not before, right?

Haleygo · 2025-10-23T17:29:16Z

internal/controller/operator/factory/vmdistributedcluster/vmdistributedcluster.go

+		if err != nil {
+			return false, err
+		}
+		return metricValue == 0, nil


The value of vmagent_remotewrite_pending_data_bytes may exceed 0 sometimes because it also includes in-memory data not yet flushed to remote storage. it's better to set a small threshold, such as 1e6; if the value is below this, the queue can be considered drained.

It's better to use metric vm_persistentqueue_bytes_pending instead. It doesn't take in account in-memory part of the queue

vrutkovs · 2025-10-24T06:09:09Z

I believe users would expect to modify the vmcluster spec value or apply extra flags to the vmclusters.

Yup, setting generic overrideParams would be more flexible and, along with upgrades, would cover other maintenance tasks, i.e., adding replicas or setting flags

… uses server-side apply This prevents kubectl from creating an annotation which is too large and failing with `The CustomResourceDefinition "vmdistributedclusters.operator.victoriametrics.com" is invalid: metadata.annotations: Too long: may not be more than 262144 bytes` error

…" e2e test

vrutkovs requested review from AndrewChubatiuk, Haleygo and f41gh7 as code owners October 21, 2025 07:27

f41gh7 self-assigned this Oct 21, 2025

AndrewChubatiuk reviewed Oct 21, 2025

View reviewed changes

config/rbac/role.yaml Show resolved Hide resolved

AndrewChubatiuk reviewed Oct 21, 2025

View reviewed changes

config/rbac/operator_vmdistributedcluster_editor_role.yaml Outdated Show resolved Hide resolved

vrutkovs force-pushed the vmdistributed-cluster branch from eaeacd4 to c02b24c Compare October 21, 2025 09:01

Haleygo reviewed Oct 23, 2025

View reviewed changes

vrutkovs added 19 commits October 29, 2025 17:29

VMDistributedCluster: initial commit

99d0633

Controller update

3ece4ca

Generate v1alpha1 types

416d890

.golangci.yml: add alias for v1alpha1

17b7e12

Regenerate manifests

031c31a

Register v1alpha1 type in manager and e2e tests

1de1745

Initial operator implementation

906859a

Add tests for vmdistributedcluster controller

0f06e61

vmdistributedcluster tests: count actions that we do

eea2ce7

test/e2e: add vmdistributedcluster e2e tests

ac563ba

api/operator/v1alpha1: add missing Resource method

200c84d

Update e2e tests

d4bf292

Add delete cases

4ea0917

Get fresh vmcluster before update

a6fc4b2

Wait for VMCluster to become operational after upgrade again

fd57159

vmdistributedcluster: add e2e tests

38ac338

vmdistributed: wait for vmclusters to become operational

6543df8

cleanup e2e tests

d5ee96a

vrutkovs added 20 commits October 29, 2025 17:29

tweaks

d23297f

vmagent metrics should be checked after vmcluster is ready

8e2b210

Use vmagent which doesn't take in-memory data into account

eb37018

test fixes

d1a3b3d

Fix "should handle rolling updates with VMAgent configuration changes…

91760a3

…" e2e test

More e2e test fixes

e5ab364

Add unit tests for TestGetReferencedVMCluster

706d2a2

Add unit tests for reconcileInlineVMCluster

09367e5

Add tests for findVMUserReadRuleForVMCluster

23751a7

Add unit test for waitForVMClusterReady

deab2cb

Fix some e2e tests

27f5892

Add reconciliation e2e

8c159a1

test fixes

e5a38a7

Fixes

62126a5

Update vmdistributedcluster_test.go

f3b73e8

fix paused reconciliation test

36be990

Fixes

8317e06

Fix updateVMUserTargetRefs

06f6e6c

More test fixes

c113e01

Fix waiting test

1441ebe

vrutkovs force-pushed the vmdistributed-cluster branch from 8945925 to 1441ebe Compare October 29, 2025 16:39

vrutkovs added 6 commits October 29, 2025 17:48

gci

bb35a3e

fetchVMClusters: rewrite as switch

e786ec6

Fix lint errors

552cbd2

tests: use real statuses for mockClientWithPollingResponse

f8f8f57

Generate fresh CRD

0b4b286

Makefile: remove descriptions from CRDs

04b44f9

vrutkovs force-pushed the vmdistributed-cluster branch from 280b2e6 to 04b44f9 Compare October 30, 2025 08:56

vrutkovs requested review from AndrewChubatiuk and Haleygo October 30, 2025 14:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

VMDistributedCluster CR: orchestrate VMCluster upgrades #1556

VMDistributedCluster CR: orchestrate VMCluster upgrades #1556

Uh oh!

vrutkovs commented Oct 21, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

AndrewChubatiuk commented Oct 21, 2025

Uh oh!

vrutkovs commented Oct 21, 2025

Uh oh!

Haleygo left a comment

Uh oh!

Haleygo Oct 23, 2025

Uh oh!

vrutkovs Oct 24, 2025

Uh oh!

Haleygo Oct 23, 2025

Uh oh!

f41gh7 Oct 24, 2025

Uh oh!

vrutkovs commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

VMDistributedCluster CR: orchestrate VMCluster upgrades #1556

Are you sure you want to change the base?

VMDistributedCluster CR: orchestrate VMCluster upgrades #1556

Uh oh!

Conversation

vrutkovs commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AndrewChubatiuk commented Oct 21, 2025

Uh oh!

vrutkovs commented Oct 21, 2025

Uh oh!

Haleygo left a comment

Choose a reason for hiding this comment

Uh oh!

Haleygo Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

vrutkovs Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

Haleygo Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

f41gh7 Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

vrutkovs commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vrutkovs commented Oct 21, 2025 •

edited

Loading