Differences in RPS to S3 between using memcached and elasticache from store gateway #10337
Replies: 1 comment
-
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi team. Looking for a bit of guidance on how we can investigate a bit of an odd issue.
We recently switched from using elasticache memcached to in-cluster memcached (using mimir-distributed).
All seemingly works fine in terms of performance. However, we are seeing the RPS to s3 being significantly higher (and take longer to go down from a cold cache reboot to baseline) after the switch.
Before the switch average 2 hour window with little usage (by far the most are ruler related)
After the switch after the system had populated the cache:
The RPS to memcached are the same for both and we have a 100% cache hit ratio for chunks-cache during that time
Of course so far I've been thinking that the chunks-cache might be at fault here but now I post the above I notice the hit ratio for postings on the index cache drops down to as low as 60% at times whereas for the earlier time chunk (pre-switch) we have:
Where it seems postings aren't even getting cached at all for the majority of the time?
But postings on the index-cache aside part of me thinks the root issue is still with the cunks-cache as when we do a cold cache reboot/flush all it can take up to 3 hours for the RPS to go from high back down to a stable no of RPS (whereas before the switch on Elasticache this would happen within minutes).
I think I have aligned Memcached settings to be the same defaults as Elasticache (although initially it was the same config shipped with mimir-distributed chart).
The difference on index-cache and postings hit rate is the only thing I see different pre-post migration on the Reads dashboard and I'm wondering how we can troubleshoot this further and get to the bottom of the higher RPS to s3 from the store gateway?
The only other thing that happened around a similar time is we updated to the latest weekly release although nothing in the changelog between last weeks and this weeks seems to have changed (this is our dev environment with ~607k samples/sec, 60 exemplars/sec and 28.4M in memory series. Nothing else has changed in terms of recording rules or metrics ingest purely the switch to memcached from elasticache).
Beta Was this translation helpful? Give feedback.
All reactions