Differences in RPS to S3 between using memcached and elasticache from store gateway #10337

rknightion · 2025-01-03T09:06:51Z

rknightion
Jan 3, 2025

Hi team. Looking for a bit of guidance on how we can investigate a bit of an odd issue.
We recently switched from using elasticache memcached to in-cluster memcached (using mimir-distributed).
All seemingly works fine in terms of performance. However, we are seeing the RPS to s3 being significantly higher (and take longer to go down from a cold cache reboot to baseline) after the switch.

Before the switch average 2 hour window with little usage (by far the most are ruler related)

After the switch after the system had populated the cache:

The RPS to memcached are the same for both and we have a 100% cache hit ratio for chunks-cache during that time

Of course so far I've been thinking that the chunks-cache might be at fault here but now I post the above I notice the hit ratio for postings on the index cache drops down to as low as 60% at times whereas for the earlier time chunk (pre-switch) we have:

Where it seems postings aren't even getting cached at all for the majority of the time?

But postings on the index-cache aside part of me thinks the root issue is still with the cunks-cache as when we do a cold cache reboot/flush all it can take up to 3 hours for the RPS to go from high back down to a stable no of RPS (whereas before the switch on Elasticache this would happen within minutes).

I think I have aligned Memcached settings to be the same defaults as Elasticache (although initially it was the same config shipped with mimir-distributed chart).

The difference on index-cache and postings hit rate is the only thing I see different pre-post migration on the Reads dashboard and I'm wondering how we can troubleshoot this further and get to the bottom of the higher RPS to s3 from the store gateway?
The only other thing that happened around a similar time is we updated to the latest weekly release although nothing in the changelog between last weeks and this weeks seems to have changed (this is our dev environment with ~607k samples/sec, 60 exemplars/sec and 28.4M in memory series. Nothing else has changed in terms of recording rules or metrics ingest purely the switch to memcached from elasticache).

rknightion · 2025-01-03T11:35:55Z

rknightion
Jan 3, 2025
Author

So I think I was able to solve my own issue (at least enough to make the RPS more managable) by changing the memcached default config in the distributed chart to the below significantly increased the time to baseline S3 RPS and while the baseline is still 10x higher before the migration it's back to being in an affordable range. Below is what we did incase anyone stumbles across the same (using latest bitnami memcached chart)

args:
  - /run.sh
  - -m 16384
  - --extended=modern
  - -n 512
  - -o hash_algorithm=jenkins,slab_chunk_max=512,hashpower=16,slab_automove=0,track_sizes=1
  - -I 16777216
  - -c 65000
  - -u 11211
  - -U 11211
  - -f 1.75

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Differences in RPS to S3 between using memcached and elasticache from store gateway #10337

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Differences in RPS to S3 between using memcached and elasticache from store gateway #10337

rknightion Jan 3, 2025

Replies: 1 comment

rknightion Jan 3, 2025 Author

rknightion
Jan 3, 2025

rknightion
Jan 3, 2025
Author