vmagent scraped samples missing when scaling shards #1224

xiaozongyang · 2025-01-20T04:24:57Z

Describe the bug

We deploy a 4-shard vmagent (2 replications) cluster managed by vm-operator, when I scale to 5 shards, some samples are missing during these scaling operation.

I noticed that, when I change shardCount option from 4 to 5 of VMAgent resource

vm-operator pod restarted
vmagent-0-0 restart and vmagent-0-1, ...vmagent-3-0, vmagent-3-1, and all these pods flag -promscrape.cluster.membersCount=5 was set to 5
vmagent-4-0 and vmagent-4-1 created and run

In step 2, the memebersCount was set to 5, but running shards are only 4, which means that samples of 1/5 targets are missing. Here are the screenshot of vm_promscrape_scraped_samples_sum during this operation

I think the right steps should be

start vmagent-4-0 with membersCount=5 and membersShard=4
restart pods from vmagent-0-0 to vmagent-3-1 with membersCount=5

To Reproduce

runing a vmagent cluster manged by vmoperator
change shardCount
watch the changes of vm_promscrape_scraped_samples_sum

Version

vm-operator version v0.42.4
vmagent version vmagent-20240301-013655-tags-v1.99.0-0-g9cd4b0537

Logs

No response

Screenshots

No response

Used command-line flags

No response

Additional information

No response

The text was updated successfully, but these errors were encountered:

xiaozongyang · 2025-01-20T04:29:07Z

ping @hagen1778 @Amper

jiekun · 2025-01-20T07:18:45Z

Hello there! Thank you for the report. hagen1778 and Amper are not responsible for the operator and I believe our maintainers will look into this bug report soon!

jiekun · 2025-01-20T07:32:10Z

I'm new to the operator code base but @f41gh7 do you think changing the sequence here could work?

operator/internal/controller/operator/factory/vmagent/vmagent.go

Line 169 in 275b0bb

for shardNum := 0; shardNum < shardsCount; shardNum++ {

zekker6 · 2025-01-20T07:49:22Z

Transferred this issue to the operator repository as it is not related to VictoriaMetrics itself.

f41gh7 · 2025-01-20T08:16:08Z

Hello,

Changing pod update sequence on upscaling and downscaling should reduce the amount of lost metrics. But it cannot prevent it. Possible solution for it could be deploying a fresh set of statefulsets/deployments, but it may lead to persistent queue data loss. If persistent queues weren't empty during upgrade.

Overall, I think we should check for the shard upscaling and use alternative update sequence for it.

xiaozongyang · 2025-01-21T03:38:51Z

@f41gh7 Maybe we could lose less data with following changes

For upscaling, start new shard then restart with cahgned memberCount before change member count
For Downscaling, restart with changed memberCount and delete redundant shards
For terminalting, set proper graceful shutdown time to wait vmagent flush persistent queue data to vminsert

Overall, I think we should check for the shard upscaling and use alternative update sequence for it.
Will the operator do this work in the future?

f41gh7 · 2025-01-31T08:35:54Z

@f41gh7 Maybe we could lose less data with following changes
1. For upscaling, start new shard then restart with cahgned memberCount before change member count

2. For Downscaling, restart with changed memberCount and delete redundant shards

3. For terminalting, set proper graceful shutdown time to wait vmagent flush persistent queue data to vminsert
Overall, I think we should check for the shard upscaling and use alternative update sequence for it.
Will the operator do this work in the future?

We'll add it to our road map. And most probably it'll be a part of the next release

Previously, operator performed new configuration rollout starting with 0 shard. It may lead to small data loss, since the number of shards changed and some targets were scheduled for higher shard nums. This commit changes the order of shard processing. Now in case of upscaling, it starts to rollout from max shard num. For instance, if shardCount changed from 3 to 6, operator will rollout change to 6,5,4,3,2,1 ( previusly it was 1,2,3,4,5,6). Related issue: #1224 Signed-off-by: f41gh7 <[email protected]>

xiaozongyang added the bug Something isn't working label Jan 20, 2025

zekker6 transferred this issue from VictoriaMetrics/VictoriaMetrics Jan 20, 2025

f41gh7 added the waiting for release The change was merged to upstream, but wasn't released yet. label Jan 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vmagent scraped samples missing when scaling shards #1224

vmagent scraped samples missing when scaling shards #1224

xiaozongyang commented Jan 20, 2025 •

edited

Loading

xiaozongyang commented Jan 20, 2025

jiekun commented Jan 20, 2025

jiekun commented Jan 20, 2025

zekker6 commented Jan 20, 2025

f41gh7 commented Jan 20, 2025

xiaozongyang commented Jan 21, 2025

f41gh7 commented Jan 31, 2025

vmagent scraped samples missing when scaling shards #1224

vmagent scraped samples missing when scaling shards #1224

Comments

xiaozongyang commented Jan 20, 2025 • edited Loading

Describe the bug

To Reproduce

Version

Logs

Screenshots

Used command-line flags

Additional information

xiaozongyang commented Jan 20, 2025

jiekun commented Jan 20, 2025

jiekun commented Jan 20, 2025

zekker6 commented Jan 20, 2025

f41gh7 commented Jan 20, 2025

xiaozongyang commented Jan 21, 2025

f41gh7 commented Jan 31, 2025

xiaozongyang commented Jan 20, 2025 •

edited

Loading