-
Notifications
You must be signed in to change notification settings - Fork 544
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: Store-gateway is stuck in starting phase #10279
Comments
@narqo can you help here? |
From the store-gateway's logs on the screenshot can you tell if the system's managed to load all the discovered blocks, or it has stuck loading any of the remaining ones (e.g. compare the IDs of the blocks in the You may want to collect a goroutine profile and Go runtime trace to explore where exactly it stuck.
Also, could you share more details about the system. Which version of Mimir is that? Can you share the configuration options you're running with (please, make sure to redact any sensitive information from the config)? Can you attach the whole log file from the point of store-gateway's start (the text not a screenshot)? |
Config File
Log
|
It is continuously loading new blocks. I've been unable to query anything in the past 24 hours ts. Previously I was able to query 90 days of data. But after pushing the last 80GB of TSDB data it is stuck in the |
@pracucci can you help here? |
If the store-gateway is stuck in the starting phase and the local disk utilization is also growing, then it means the store-gateway is loading the new blocks (maybe very slow, but loading). On the contrary, if the disk utilization is not growing, then it looks stuck as you say. Which one of the two? |
The disk space is growing, so in the current scenario, we saw something interesting. The blocks at /media/storage/tsdb-sync/anonymous are at |
We tested the same thing on a K8s cluster with a default config (we just added S3 credentials), and the store gateway is still loading new blocks. |
I calculated the tsdb block so there are a total |
Base on these points above, it seems that one single instance of
|
|
Is there any tool that can validate all the blocks in S3? We had around 1TB of blocks, so we directly pushed them to S3. (We tested with ingestor backfill, but a single instance was unable to handle this volume.) I have tested with the mimirtool bucket validation thing, and I don't see any error there. |
Also if you need any metrics from mimir let me know we have Prometheus scraping mimir |
@narqo @pracucci Now the compactor is able to update the mimir/pkg/storage/tsdb/bucketindex/updater.go Line 149 in fc8af05
But the tsdb-sync by store-gateway for getting the index-header is taking long (45 mins per block). Currently store-gateway synced 414718 blocks and compactor latest updated blocks are 414728 Because we're ingesting data the compactor will compact and upload new data to s3 bcoz the tsdb-sync is slow it won't sync with blocks in bucket-index.json.gz Below is the latest config. The avg index file in s3 is 50mb to 160mb
|
Do you still run a single instance of Mimir? How many CPU cores does the VM have? The compactors can be scaled out both horizontally and vertically (ref the "scaling" section of its docs). You may observe it is overloaded and needs more resources in the "Mimir / Compactor" and "Mimir / Compactor resources" dashboards. Could you post screenshots with these dashboards? As for store-gateway, you may check if its If I follow you right, it seems that the clue to where your store-gateway is being busy is in this code path in the
|
@narqo I did some changes so compactor doesn't frequently update the |
Sharing logs
|
We also notice that store-gateway is loading same blocks again and again
|
We have now disabled lazy loading. Store-gateway is synced with bucket-index.json. Now, generating |
In the logs you posted, the chunks |
No it is not crashing. After changing the config we did restart the mimir process |
|
Also, looking at the durations in the provided logs: if loading one block takes ~450ms, for 400K blocks and the
I.e., in theory, the store-gateway needs 2 hours to load the blocks' meta-data from the bucket. Also from logs, mimir is still running in the monolithic mode, meaning that all of its components are actively fighting for the VM's resources: CPU, memory AND the network bandwidth. Note that your VM only has 8 CPU cores — from your |
I don't see any network issues regarding S3. Also, now that the store gateway has loaded all the blocks and got the index header and spare header, why is it still in the starting phase? What are the possible reasons? We tested this same on a 32cpu 128gb ram VM and still it is in |
What is the bug?
I recently uploaded around 90GB of TSDB data directly to S3 after that, my store gateway is stuck in the starting phase. I have enabled debug logs but don't see any error (sharing ss). I have used this approach previously for more than 7 times, but now it is causing this problem. [CONTEXT: Doing influx to mimir migration, using promtool to generate tsdb].
How to reproduce it?
Push tsdb block to s3 and query the data using grafana for timestamps greater than 24hrs.
What did you think would happen?
I don't know why it is taking so long to load tsdb block. It was working previously.
What was your environment?
deployment was done using puppet on VM. Currently running single instance on 1 VM.
Any additional context to share?
No response
The text was updated successfully, but these errors were encountered: