Is this the suppossed behavior of cache? #2967

jcarres-mdsol · 2023-09-27T15:25:11Z

jcarres-mdsol
Sep 27, 2023

We have a new installation of Tempo and have the feeling we are doing memcache wrong.

We are using the tempo-distributed helm chart, relevant sections:

  storage:
    trace:
      backend: s3
      s3:
        bucket: ....
        endpoint: ....
        region: ...
      cache: memcached
      cache_max_block_age: 504h
      cache_min_compaction_level: 0
      memcached:
        host: memcached
        max_idle_conns: 100
        service: memcached-client
        timeout: 200ms

  memcached:
    resources:
      limits:
        memory: 5Gi
      requests:
        cpu: 600m
        memory: 5Gi

  memcachedExporter:
    enabled: true

The behavior we see is that a TraceQL query will make the cache grow a little, displaying a trace will make it grow by a lot. This is consistent with thinking that the cache is storing results. This make sense, chances are if someone retrieves a trace that that person will share it with a colleague or return to it later.

But there are some things that makes me suspicious:

Response times from memcache are around 200ms (and I've just noticed our timeout is around that time, this may be causing some of these issues) but retrieving a trace the second time is marginally faster. Retrieving a big trace may take 8 seconds the first time and 6 seconds the second time.
In the latest community call someone talked about cache and said they had low hit ratio, the answer was to just throw more memory to it and it was said grafana labs sees 96%+ hit ratio. I can't see how this is consistent with how our cache works. The very least the first attempt to retrieve something will ask the cache for it and will never be there.
From the caching docs there is talk about bloom filters been in the cache, I see no evidence of this happening. But can't tell really how to know.
Ingester traffic and cache traffic are paired. It could be that the query frontend ask to both the cache and the ingester the same question so even if the cache responds the ingester is trying to find the information also? Or maybe this is the reason further calls are not significant faster maybe our cache has data but really not been used and trace always retrieved from ingestor anyways.

In short, I doubt we have set this correctly.

joe-elliott · 2023-09-29T16:15:37Z

joe-elliott
Sep 29, 2023
Maintainer

The behavior we see is that a TraceQL query will make the cache grow a little

This is correct b/c we only cache Parquet footers right now. @electron0zero is currently working on a change for Parquet page level caching. Once this is added a TraceQL query will make heavy use of cache.

displaying a trace will make it grow by a lot.

This is also likely correct b/c we cache bloom filters depending on the provided settings. So if you have 10k blocks a trace by id search will access 10k bloom filters and attempt to cache them.

This is consistent with thinking that the cache is storing results

We do not cache results, but I would like to. We are seeing more patterns where people will put a TraceQL query on a dashboard that refreshes on a regular interval (e.g. every 30s). We have discussed frontend job caching as a way to help speed these queries up, but have not put work towards it yet.

I've just noticed our timeout is around that time, this may be causing some of these issues

You would see logs like this if tempo were timing out to memcached:

ts=2023-09-29T16:07:52.746065563Z caller=spanlogger.go:85 org_id=vulture-tenant method=Memcache.GetMulti level=error msg="Failed to get keys from memcached" err="read tcp 10.15.8.1:39036->10.15.201.148:11211: i/o timeout"

but retrieving a trace the second time is marginally faster.

This is consistent with bloom filter caching. You should also see fewer hits to your object storage the second time.

The very least the first attempt to retrieve something will ask the cache for it and will never be there.

We constantly query all of our internal Tempo instances using the vulture to ensure correctness. This definitely inflates our memcached hit rate.

Ingester traffic and cache traffic are paired.

When the ingester writes a block, it should also write the bloom filters of that block to cache. This is occurring b/c you have this set: cache_min_compaction_level: 0. We leave Tempo at the default of 2 b/c we have extremely high churn in our level 0 and 1 blocks. Your setting is not wrong, it's just a bit different than we do things.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is this the suppossed behavior of cache? #2967

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Is this the suppossed behavior of cache? #2967

jcarres-mdsol Sep 27, 2023

Replies: 1 comment

joe-elliott Sep 29, 2023 Maintainer

jcarres-mdsol
Sep 27, 2023

joe-elliott
Sep 29, 2023
Maintainer