Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vmagent scraped samples missing when scaling shards #1224

Open
xiaozongyang opened this issue Jan 20, 2025 · 7 comments
Open

vmagent scraped samples missing when scaling shards #1224

xiaozongyang opened this issue Jan 20, 2025 · 7 comments
Labels
bug Something isn't working waiting for release The change was merged to upstream, but wasn't released yet.

Comments

@xiaozongyang
Copy link

xiaozongyang commented Jan 20, 2025

Describe the bug

We deploy a 4-shard vmagent (2 replications) cluster managed by vm-operator, when I scale to 5 shards, some samples are missing during these scaling operation.

I noticed that, when I change shardCount option from 4 to 5 of VMAgent resource

  1. vm-operator pod restarted
  2. vmagent-0-0 restart and vmagent-0-1, ...vmagent-3-0, vmagent-3-1, and all these pods flag -promscrape.cluster.membersCount=5 was set to 5
  3. vmagent-4-0 and vmagent-4-1 created and run

In step 2, the memebersCount was set to 5, but running shards are only 4, which means that samples of 1/5 targets are missing. Here are the screenshot of vm_promscrape_scraped_samples_sum during this operation

Image

I think the right steps should be

  1. start vmagent-4-0 with membersCount=5 and membersShard=4
  2. restart pods from vmagent-0-0 to vmagent-3-1 with membersCount=5

To Reproduce

  1. runing a vmagent cluster manged by vmoperator
  2. change shardCount
  3. watch the changes of vm_promscrape_scraped_samples_sum

Version

  • vm-operator version v0.42.4
  • vmagent version vmagent-20240301-013655-tags-v1.99.0-0-g9cd4b0537

Logs

No response

Screenshots

No response

Used command-line flags

No response

Additional information

No response

@xiaozongyang xiaozongyang added the bug Something isn't working label Jan 20, 2025
@xiaozongyang
Copy link
Author

ping @hagen1778 @Amper

@jiekun
Copy link

jiekun commented Jan 20, 2025

Hello there! Thank you for the report. hagen1778 and Amper are not responsible for the operator and I believe our maintainers will look into this bug report soon!

@jiekun
Copy link

jiekun commented Jan 20, 2025

I'm new to the operator code base but @f41gh7 do you think changing the sequence here could work?

for shardNum := 0; shardNum < shardsCount; shardNum++ {

@zekker6 zekker6 transferred this issue from VictoriaMetrics/VictoriaMetrics Jan 20, 2025
@zekker6
Copy link
Contributor

zekker6 commented Jan 20, 2025

Transferred this issue to the operator repository as it is not related to VictoriaMetrics itself.

@f41gh7
Copy link
Collaborator

f41gh7 commented Jan 20, 2025

Hello,

Changing pod update sequence on upscaling and downscaling should reduce the amount of lost metrics. But it cannot prevent it. Possible solution for it could be deploying a fresh set of statefulsets/deployments, but it may lead to persistent queue data loss. If persistent queues weren't empty during upgrade.

Overall, I think we should check for the shard upscaling and use alternative update sequence for it.

@xiaozongyang
Copy link
Author

@f41gh7 Maybe we could lose less data with following changes

  1. For upscaling, start new shard then restart with cahgned memberCount before change member count
  2. For Downscaling, restart with changed memberCount and delete redundant shards
  3. For terminalting, set proper graceful shutdown time to wait vmagent flush persistent queue data to vminsert

Overall, I think we should check for the shard upscaling and use alternative update sequence for it.
Will the operator do this work in the future?

@f41gh7
Copy link
Collaborator

f41gh7 commented Jan 31, 2025

@f41gh7 Maybe we could lose less data with following changes

1. For upscaling, start new shard then restart with cahgned memberCount before change member count

2. For Downscaling, restart with changed memberCount and delete redundant shards

3. For terminalting, set proper graceful shutdown time to wait vmagent flush persistent queue data to vminsert

Overall, I think we should check for the shard upscaling and use alternative update sequence for it.
Will the operator do this work in the future?

We'll add it to our road map. And most probably it'll be a part of the next release

f41gh7 added a commit that referenced this issue Jan 31, 2025
Previously, operator performed new configuration rollout starting with 0 shard.
It may lead to small data loss, since the number of shards changed and some targets were scheduled for
higher shard nums.

 This commit changes the order of shard processing. Now in case of upscaling, it starts to rollout from max shard num.
For instance, if shardCount changed from 3 to 6, operator will rollout change to 6,5,4,3,2,1 ( previusly it was 1,2,3,4,5,6).

 Related issue:
#1224

Signed-off-by: f41gh7 <[email protected]>
@f41gh7 f41gh7 added the waiting for release The change was merged to upstream, but wasn't released yet. label Jan 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working waiting for release The change was merged to upstream, but wasn't released yet.
Projects
None yet
Development

No branches or pull requests

4 participants