-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vmagent scraped samples missing when scaling shards #1224
Comments
ping @hagen1778 @Amper |
Hello there! Thank you for the report. hagen1778 and Amper are not responsible for the operator and I believe our maintainers will look into this bug report soon! |
I'm new to the operator code base but @f41gh7 do you think changing the sequence here could work?
|
Transferred this issue to the operator repository as it is not related to VictoriaMetrics itself. |
Hello, Changing pod update sequence on upscaling and downscaling should reduce the amount of lost metrics. But it cannot prevent it. Possible solution for it could be deploying a fresh set of Overall, I think we should check for the shard upscaling and use alternative update sequence for it. |
@f41gh7 Maybe we could lose less data with following changes
|
We'll add it to our road map. And most probably it'll be a part of the next release |
Previously, operator performed new configuration rollout starting with 0 shard. It may lead to small data loss, since the number of shards changed and some targets were scheduled for higher shard nums. This commit changes the order of shard processing. Now in case of upscaling, it starts to rollout from max shard num. For instance, if shardCount changed from 3 to 6, operator will rollout change to 6,5,4,3,2,1 ( previusly it was 1,2,3,4,5,6). Related issue: #1224 Signed-off-by: f41gh7 <[email protected]>
Describe the bug
We deploy a 4-shard vmagent (2 replications) cluster managed by vm-operator, when I scale to 5 shards, some samples are missing during these scaling operation.
I noticed that, when I change
shardCount
option from4
to5
ofVMAgent
resourcevm-operator
pod restartedvmagent-0-0
restart andvmagent-0-1
, ...vmagent-3-0
,vmagent-3-1
, and all these pods flag-promscrape.cluster.membersCount=5
was set to5
vmagent-4-0
andvmagent-4-1
created and runIn step 2, the
memebersCount
was set to 5, but running shards are only 4, which means that samples of 1/5 targets are missing. Here are the screenshot ofvm_promscrape_scraped_samples_sum
during this operationI think the right steps should be
vmagent-4-0
withmembersCount=5
andmembersShard=4
vmagent-0-0
tovmagent-3-1
withmembersCount=5
To Reproduce
shardCount
vm_promscrape_scraped_samples_sum
Version
v0.42.4
vmagent-20240301-013655-tags-v1.99.0-0-g9cd4b0537
Logs
No response
Screenshots
No response
Used command-line flags
No response
Additional information
No response
The text was updated successfully, but these errors were encountered: