Skip to content

Conversation

@vrutkovs
Copy link
Contributor

@vrutkovs vrutkovs commented Oct 21, 2025

Add a new CR - VMDistributedCluster -so that multiple VMClusters could be upgraded in orchestrated fashion, ensuring the read VMAuth is disabled before upgrade and VMAgent (if available) doesn't have pending bytes to send.

See #1515 (comment) for agreed limitations for v1alpha1 version.

Fixes #1515

TODO:

  • Add changelog entry
  • Description-less CRD should be applied for development only. Rephrase descriptions in existing parts to make it fit for production
  • Fix flaking tests
  • Squash commits
    Keeping original commits for review as its useful to show how the feature was developed
  • Allow objects from other namespaces?
    During development I realized VMAuth / VMClusters may be in different namespaces. The initial version requires all objects to be in the same namespace as VMDistributedCluster. Do we want to keep it that way for API simplicity or worth having the initial version with cross-namespace support?
  • Other improvements:
    • Multiple VMAgents per VMCluster - use label selector instead of name?
    • Label selector for VMClusters? Not sure how to select VMAgents then

@f41gh7 f41gh7 self-assigned this Oct 21, 2025
@AndrewChubatiuk
Copy link
Contributor

initially thought distributed CR is needed for full distributed setup management, but looks like it only performs version upgrade. In this case just curious why we need different CRs for VM, VT and VL?

@vrutkovs vrutkovs force-pushed the vmdistributed-cluster branch from eaeacd4 to c02b24c Compare October 21, 2025 09:01
@vrutkovs
Copy link
Contributor Author

Yes, so far we're focusing on upgrades - existing CRs provide sufficient flexibility IMO - and we didn't get a request for other actions so far.

In this case just curious why we need different CRs for VM, VT and VL?

VL and VT don't have agents (yet) so their specs would be different. However we can reuse the same approach and probably even some helper functions

Copy link
Contributor

@Haleygo Haleygo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, so far we're focusing on upgrades - existing CRs provide sufficient flexibility IMO - and we didn't get a request for other actions so far.

I believe users would expect to modify the vmcluster spec value or apply extra flags to the vmclusters.
And since vmclusterSpec.ClusterVersion is optional, users could specify component versions inside vmclusterSpec which overrides the vmclusterSpec.ClusterVersion.

And currently, it seems VMDistributedCluster only covers a limited scenario where resources like vmcluster, vmuser, vmauth are defined and configured as needed.
Could you please provide an example of how to config them to achieve similar topology described in victoria-metrics-distributed chart? I expect VMDistributedCluster to be supported there when released.

if vmClusterAgentPair.VMAgent != nil {
vmAgent = &vmAgentAdapter{VMAgent: vmClusterAgentPair.VMAgent}
}
waitForVMClusterVMAgentMetrics(ctx, httpClient, vmAgent, deadline)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not need to check the vmagent persistent queue here; it should occur after updating the vmcluster and before resuming vmuser reading from the vmcluster, which ensures data integrity on the upgraded vmcluster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, so I should move it after waitForVMClusterReady, not before, right?

if err != nil {
return false, err
}
return metricValue == 0, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value of vmagent_remotewrite_pending_data_bytes may exceed 0 sometimes because it also includes in-memory data not yet flushed to remote storage. it's better to set a small threshold, such as 1e6; if the value is below this, the queue can be considered drained.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to use metric vm_persistentqueue_bytes_pending instead. It doesn't take in account in-memory part of the queue

@vrutkovs
Copy link
Contributor Author

I believe users would expect to modify the vmcluster spec value or apply extra flags to the vmclusters.

Yup, setting generic overrideParams would be more flexible and, along with upgrades, would cover other maintenance tasks, i.e., adding replicas or setting flags

@vrutkovs vrutkovs force-pushed the vmdistributed-cluster branch from 8945925 to 1441ebe Compare October 29, 2025 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for a distributed deployment

4 participants