Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check cached postings TTL before returning from cache + metrics #822

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

pracucci
Copy link
Collaborator

@pracucci pracucci commented Jan 17, 2025

In this PR I propose two changes to PostingsForMatchers cache:

  1. Check if the TTL for cached postings is still valid before returning it from cache, to fix a race condition that could happen between a goroutine running expire() and another one skipping the expire() execution because it's already in progress.
  2. Add metrics to PostingsForMatchers cache, to have better visibility over it. This is something I wanted to do since a long time. The design I picked is to allow to pass the metrics struct as DB options, so that we can use 1 single struct for all per-tenant TSDBs in a Mimir ingester.

The following benchmark shows the difference betweeen:

goos: darwin
goarch: arm64
pkg: github.com/prometheus/prometheus/tsdb
cpu: Apple M3 Pro
                                                          │ 01bb37aae.txt │           6c2603082.txt            │              main.txt              │               pr.txt               │
                                                          │    sec/op     │   sec/op     vs base               │   sec/op     vs base               │   sec/op     vs base               │
PostingsForMatchersCache/no_evictions-11                      594.8n ± 3%   584.2n ± 1%   -1.78% (p=0.015 n=6)   602.9n ± 3%        ~ (p=0.589 n=6)   644.6n ± 1%   +8.38% (p=0.002 n=6)
PostingsForMatchersCache/high_eviction_rate-11                10.52µ ± 5%   10.39µ ± 1%   -1.28% (p=0.002 n=6)   10.85µ ± 1%        ~ (p=0.065 n=6)   10.77µ ± 0%        ~ (p=0.065 n=6)
PostingsForMatchersCache_ConcurrencyOnHighEvictionRate-11    1411.5n ± 2%   301.1n ± 1%  -78.67% (p=0.002 n=6)   306.2n ± 1%  -78.31% (p=0.002 n=6)   331.2n ± 2%  -76.54% (p=0.002 n=6)
geomean                                                       2.067µ        1.222µ       -40.86%                 1.260µ       -39.03%                 1.320µ       -36.15%

                                                          │ 01bb37aae.txt │             6c2603082.txt             │               main.txt                │               pr.txt                │
                                                          │     B/op      │     B/op      vs base                 │     B/op      vs base                 │     B/op      vs base               │
PostingsForMatchersCache/no_evictions-11                       958.0 ± 0%     958.0 ± 0%        ~ (p=1.000 n=6) ¹     958.0 ± 0%        ~ (p=1.000 n=6) ¹     974.0 ± 0%   +1.67% (p=0.002 n=6)
PostingsForMatchersCache/high_eviction_rate-11               26.38Ki ± 0%   26.38Ki ± 0%        ~ (p=1.000 n=6) ¹   26.38Ki ± 0%        ~ (p=1.000 n=6) ¹   26.32Ki ± 0%   -0.24% (p=0.002 n=6)
PostingsForMatchersCache_ConcurrencyOnHighEvictionRate-11     1441.0 ± 0%    1017.0 ± 0%  -29.42% (p=0.002 n=6)      1014.0 ± 0%  -29.63% (p=0.002 n=6)      1024.5 ± 0%  -28.90% (p=0.002 n=6)
geomean                                                      3.263Ki        2.905Ki       -10.97%                   2.902Ki       -11.05%                   2.926Ki       -10.33%
¹ all samples are equal

                                                          │ 01bb37aae.txt │            6c2603082.txt            │              main.txt               │               pr.txt                │
                                                          │   allocs/op   │ allocs/op   vs base                 │ allocs/op   vs base                 │ allocs/op   vs base                 │
PostingsForMatchersCache/no_evictions-11                       20.00 ± 0%   20.00 ± 0%        ~ (p=1.000 n=6) ¹   20.00 ± 0%        ~ (p=1.000 n=6) ¹   20.00 ± 0%        ~ (p=1.000 n=6) ¹
PostingsForMatchersCache/high_eviction_rate-11                 48.00 ± 0%   48.00 ± 0%        ~ (p=1.000 n=6) ¹   48.00 ± 0%        ~ (p=1.000 n=6) ¹   46.00 ± 0%   -4.17% (p=0.002 n=6)
PostingsForMatchersCache_ConcurrencyOnHighEvictionRate-11      29.00 ± 0%   21.00 ± 0%  -27.59% (p=0.002 n=6)     21.00 ± 0%  -27.59% (p=0.002 n=6)     20.00 ± 5%  -31.03% (p=0.002 n=6)
geomean                                                        30.31        27.22       -10.20%                   27.22       -10.20%                   26.40       -12.89%
¹ all samples are equal

Copy link
Contributor

@charleskorn charleskorn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach LGTM.

It'd be interesting to benchmark this and compare the cost of checking the done channel with the approach prior to #734.

case <-oldPromise.done:
if c.timeNow().Sub(oldPromise.evaluationCompletedAt) >= c.ttl {
// The cached promise already expired, but it has not been evicted.
// TODO trace + metric
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not something for this PR) Metrics for the PFMC in general would be handy - would be nice to see the rate of cache hits and misses, as well as the rate of cache evictions and their reason (TTL vs cache growing too large).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as well as the rate of cache evictions and their reason (TTL vs cache growing too large)

Will do in a separate PR.

@pracucci pracucci force-pushed the check-cached-postings-ttl-before-returning-from-cache branch from b826f27 to 69f78e0 Compare January 22, 2025 11:47
…otherGoroutineIsEvictingTheCache

Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
@pracucci pracucci changed the title Check cached postings TTL before returning from cache Check cached postings TTL before returning from cache + metrics Jan 22, 2025
Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
Copy link
Contributor

@charleskorn charleskorn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM

select {
case <-firstCallReceived:
case <-time.After(5 * time.Second):
// Do not block forever. The test will anyway in this case.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a word missing in this comment - "the test will XXX (fail?) anyway in this case"

@@ -211,13 +254,15 @@ func (c *PostingsForMatchersCache) postingsForMatchersPromise(ctx context.Contex
}
}

c.metrics.hits.Inc()
span.AddEvent("using cached postingsForMatchers promise", trace.WithAttributes(
attribute.String("cache_key", key),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to include evaluationCompletedAt here as well, to replace what was reverted from #820.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants