Add: exponential backoff for CAS operations on floats #1661

imorph · 2024-10-26T12:40:09Z

What problem this PR is solving

Hi!
This kind of issues: cockroachdb/cockroach#133306 pushed me to investigate if it is possible to optimize client metrics library in terms of CPU and performance especially under contention. So mostly this PR is about that.

Proposed changes

Add sort of exponential backoff for tight loops on CAS operations, which potentially will decrease contention and as a result will lead to better latencies and lower CPU consumption. This is well known trick that is used in many projects that deal with concurrency.

In addition this logic was refactored into single place in code because case of atomic update of float64 can be meet in different parts of the codebase.

Results

All test results are from AWS c7i.2xlarge type of instance.

main vs proposed for histograms:

goos: linux
goarch: amd64
pkg: github.com/prometheus/client_golang/prometheus
cpu: Intel(R) Xeon(R) Platinum 8488C
                           │ ../../bhOLD.txt │           ../../bhNEW.txt           │
                           │     sec/op      │    sec/op     vs base               │
HistogramWithLabelValues-8      87.81n ±  1%   90.36n ±  1%   +2.90% (p=0.002 n=6)
HistogramNoLabels-8             36.15n ±  1%   39.55n ±  0%   +9.42% (p=0.002 n=6)
HistogramObserve1-8             31.23n ±  1%   32.52n ±  1%   +4.15% (p=0.002 n=6)
HistogramObserve2-8            242.45n ± 16%   86.89n ±  0%  -64.16% (p=0.002 n=6)
HistogramObserve4-8             372.8n ± 14%   175.0n ± 19%  -53.05% (p=0.002 n=6)
HistogramObserve8-8             953.0n ±  1%   415.4n ± 32%  -56.41% (p=0.002 n=6)
HistogramWrite1-8               1.456µ ± 11%   1.454µ ± 12%        ~ (p=0.699 n=6)
HistogramWrite2-8               3.103µ ±  3%   3.125µ ±  7%        ~ (p=0.485 n=6)
HistogramWrite4-8               6.087µ ±  3%   6.041µ ±  4%        ~ (p=0.937 n=6)
HistogramWrite8-8               11.71µ ±  2%   11.90µ ±  3%   +1.61% (p=0.004 n=6)
geomean                         554.5n         434.5n        -21.64%

                           │ ../../bhOLD.txt │          ../../bhNEW.txt           │
                           │      B/op       │    B/op     vs base                │
HistogramWithLabelValues-8      0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
HistogramNoLabels-8             0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
geomean                                    ²               +0.00%               ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                           │ ../../bhOLD.txt │          ../../bhNEW.txt           │
                           │    allocs/op    │ allocs/op   vs base                │
HistogramWithLabelValues-8      0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
HistogramNoLabels-8             0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
geomean                                    ²               +0.00%               ²
¹ all samples are equal
² summaries must be >0 to compute geomean

changes leading to slight increase of latency in single-threaded use case but significantly decrease it in contended case.

main vs proposed for summaries:

goos: linux
goarch: amd64
pkg: github.com/prometheus/client_golang/prometheus
cpu: Intel(R) Xeon(R) Platinum 8488C
                         │ ../../bsOLD.txt │           ../../bsNEW.txt           │
                         │     sec/op      │    sec/op     vs base               │
SummaryWithLabelValues-8      264.0n ±  1%   267.6n ±  3%   +1.34% (p=0.017 n=6)
SummaryNoLabels-8             186.6n ±  1%   186.6n ±  0%        ~ (p=0.920 n=6)
SummaryObserve1-8             22.57n ±  1%   23.66n ±  0%   +4.83% (p=0.002 n=6)
SummaryObserve2-8            192.75n ± 31%   52.13n ±  1%  -72.95% (p=0.002 n=6)
SummaryObserve4-8             297.9n ± 27%   126.7n ± 16%  -57.47% (p=0.002 n=6)
SummaryObserve8-8             937.5n ±  5%   319.1n ± 48%  -65.96% (p=0.002 n=6)
SummaryWrite1-8               135.1n ± 32%   134.1n ±  9%        ~ (p=0.485 n=6)
SummaryWrite2-8               319.9n ±  3%   327.4n ±  4%        ~ (p=0.240 n=6)
SummaryWrite4-8               665.7n ±  5%   664.0n ±  9%        ~ (p=0.699 n=6)
SummaryWrite8-8               1.402µ ±  9%   1.318µ ± 11%        ~ (p=0.180 n=6)
geomean                       274.3n         198.6n        -27.59%

                         │ ../../bsOLD.txt │          ../../bsNEW.txt           │
                         │      B/op       │    B/op     vs base                │
SummaryWithLabelValues-8      0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
SummaryNoLabels-8             0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
geomean                                  ²               +0.00%               ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                         │ ../../bsOLD.txt │          ../../bsNEW.txt           │
                         │    allocs/op    │ allocs/op   vs base                │
SummaryWithLabelValues-8      0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
SummaryNoLabels-8             0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
geomean                                  ²               +0.00%               ²
¹ all samples are equal
² summaries must be >0 to compute geomean

same is true for summaries as well

Some additional illustrations:
Gap between implementations is wider for higher number of parallel/contending goroutines.

Side by side comparison of backoff/no-backoff implementations are in atomic_update_test.go file:

BenchmarkAtomicUpdateFloat_SingleGoroutine-8   	94284844	       12.59 ns/op
BenchmarkAtomicNoBackoff_SingleGoroutine-8     	97815223	       12.24 ns/op
BenchmarkAtomicUpdateFloat_1Goroutine-8        	91764830	       15.45 ns/op
BenchmarkAtomicNoBackoff_1Goroutine-8          	13282502	       90.47 ns/op
BenchmarkAtomicUpdateFloat_2Goroutines-8       	67590978	       19.70 ns/op
BenchmarkAtomicNoBackoff_2Goroutines-8         	13490074	       89.20 ns/op
BenchmarkAtomicUpdateFloat_4Goroutines-8       	88974640	       14.74 ns/op
BenchmarkAtomicNoBackoff_4Goroutines-8         	13480230	       89.29 ns/op
BenchmarkAtomicUpdateFloat_8Goroutines-8       	85171494	       13.94 ns/op
BenchmarkAtomicNoBackoff_8Goroutines-8         	13474435	       89.39 ns/op
BenchmarkAtomicUpdateFloat_16Goroutines-8      	80687557	       14.96 ns/op
BenchmarkAtomicNoBackoff_16Goroutines-8        	13556298	       90.21 ns/op
BenchmarkAtomicUpdateFloat_32Goroutines-8      	71862175	       14.98 ns/op
BenchmarkAtomicNoBackoff_32Goroutines-8        	13522851	       88.90 ns/op

CPU profiles

for go test -bench='BenchmarkHistogramObserve*' -run=^# -count=6
old top:

      flat  flat%   sum%        cum   cum%
     0.56s  0.34%  0.34%    166.28s 99.93%  github.com/prometheus/client_golang/prometheus.benchmarkHistogramObserve.func1
     0.57s  0.34%  0.68%    165.71s 99.59%  github.com/prometheus/client_golang/prometheus.(*histogram).Observe
    32.65s 19.62% 20.30%    128.03s 76.94%  github.com/prometheus/client_golang/prometheus.(*histogram).observe
    56.26s 33.81% 54.11%     95.38s 57.32%  github.com/prometheus/client_golang/prometheus.(*histogramCounts).observe
    29.38s 17.66% 71.77%     39.12s 23.51%  github.com/prometheus/client_golang/prometheus.atomicAddFloat (inline)
     0.25s  0.15% 71.92%     37.11s 22.30%  github.com/prometheus/client_golang/prometheus.(*histogram).findBucket
    33.38s 20.06% 91.98%     36.86s 22.15%  sort.SearchFloat64s (inline)
     9.10s  5.47% 97.45%      9.10s  5.47%  math.Float64frombits (inline)
     2.05s  1.23% 98.68%      3.46s  2.08%  sort.Search
     1.41s  0.85% 99.53%      1.41s  0.85%  github.com/prometheus/client_golang/prometheus.(*histogram).findBucket.SearchFloat64s.func1

new top:

      flat  flat%   sum%        cum   cum%
     0.48s  1.94%  1.94%     24.61s 99.31%  github.com/prometheus/client_golang/prometheus.benchmarkHistogramObserve.func1
     0.41s  1.65%  3.59%     24.13s 97.38%  github.com/prometheus/client_golang/prometheus.(*histogram).Observe
     7.89s 31.84% 35.43%     19.77s 79.78%  github.com/prometheus/client_golang/prometheus.(*histogram).observe
        5s 20.18% 55.61%     11.88s 47.94%  github.com/prometheus/client_golang/prometheus.(*histogramCounts).observe
         0     0% 55.61%      6.88s 27.76%  github.com/prometheus/client_golang/prometheus.atomicAddFloat (inline)
     5.99s 24.17% 79.78%      6.88s 27.76%  github.com/prometheus/client_golang/prometheus.atomicUpdateFloat
     0.18s  0.73% 80.51%      3.95s 15.94%  github.com/prometheus/client_golang/prometheus.(*histogram).findBucket
     0.51s  2.06% 82.57%      3.77s 15.21%  sort.SearchFloat64s (inline)
     2.01s  8.11% 90.68%      3.26s 13.16%  sort.Search
     1.25s  5.04% 95.72%      1.25s  5.04%  github.com/prometheus/client_golang/prometheus.(*histogram).findBucket.SearchFloat64s.func1

Downsides

Everything comes with the price, so in this case

code is slightly more complex
some magic constants introduced (10ms and 320ms -- mostly result of empirical testing)
single-threaded performance weaker by 4%-5%
time.sleep introducing additional syscall (? not sure about that)

Further improvements

Some random jitter can be added to avoid thundering herd problem
I briefly explored hybrid approach when there is staged backoff with first stage using runtime.Gosched() and only then time.sleep but it was not looking impressive from results point of view
better code layout?

@vesari
@ArthurSens
@bwplotka
@kakkoyun

Signed-off-by: Ivan Goncharov <[email protected]>

imorph · 2024-10-26T13:34:51Z

TestSummaryDecay test is failing, but this fail seems unrelated to my changes (may be I missing something but looks like it is different code path).

Most likely it fils because in the summary struct, observations added via Observe are buffered in hotBuf. When hotBuf reaches its capacity or expires, an asynchronous flush is triggered by calling asyncFlush. This flush moves the observations to coldBuf and processes them in a separate goroutine. The Write method processes the coldBuf synchronously but does not include observations still in hotBuf or those being processed asynchronously in the background goroutine. Test is rapidly adding observations and periodically calling sum.Write(). Due to the asynchronous processing of observations, some observations may not have been fully processed and included in the quantile calculations when Write is called. This could lead to random test failures.

By i might be wrong.

sean- · 2024-10-29T16:37:38Z

Prom histograms have historically been slow relative to other histogram implementations.

https://github.com/sean-/bench-go-histograms/tree/main?tab=readme-ov-file#benchmark-for-go-histogram-implementations

kakkoyun · 2024-10-31T14:36:17Z

I plan to review this PR this weekend. Sorry for the tardiness.

ArthurSens

Performance PRs are accumulating across different repos. I'm taking this weekend to study and unblock them. For this PR, I had to read more about sync/atomic. I wasn't familiar with it, which is probably why I was procrastinating the review here so much. Apologies!

Downsides
Everything comes with the price, so in this case

code is slightly more complex

What you are doing is just an exponential backoff, it doesn't sound too complex :)
The complicated part for me was understanding the already existing code around atomic additions and updates.

some magic constants introduced (10ms and 320ms -- mostly result of empirical testing)

I'm fine with keeping this number, even if other machines might have better performance with slightly different values overall it should be an improvement to all of them. But let's add a comment explaining where they come from. You can even link your PR description in the code comment so people can easily look at your benchmarks.

single-threaded performance weaker by 4%-5%

That should be fine, single-threaded environments are rare nowadays.

time.sleep introducing additional syscall (? not sure about that)

This is only called if the first CAS fails, which doesn't happen on the happy path.
I couldn't understand what you mean by "additional syscall". Are you concerned that we're doing any kind of syscall that we weren't doing before? Or the number of syscalls that are potentially increasing?

ArthurSens · 2024-11-03T13:36:47Z

prometheus/atomic_update.go

+	const (
+		maxBackoff     = 320 * time.Millisecond
+		initialBackoff = 10 * time.Millisecond
+	)


Could we add a comment explaining that those numbers were chosen based on empirical testing in one particular environment (AWS c7i.2xlarge) ?

imorph · 2024-11-05T08:14:04Z

@ArthurSens

This is only called if the first CAS fails, which doesn't happen on the happy path.

It is only happy in single-threaded case, as long as we have contending goroutines trying to CAS atomic value, inevitably we have a lot of sleeps which produce (still unsure =) ) syscalls and it is not free. Some projects sensitive to increasing number of syscalls, latest example from same folks at cockroachdb: cockroachdb/cockroach#133315

Or the number of syscalls that are potentially increasing?

yes, exactly that

Signed-off-by: Ivan Goncharov <[email protected]>

ArthurSens · 2024-11-05T12:23:47Z

Golang's sleep implementation surely does syscalls, how much of a problem that is uncertain to me.

One suggestion from the link you provided is to use time.Since or maybe you could use a time.Ticker if we could find a way to make it exponential. What do you think?

imorph · 2024-11-09T11:28:19Z

@ArthurSens

you provided is to use time.Since

i tried this implementation which avoids sleeps by using combination of since and gosched:

func atomicUpdateFloat(bits *uint64, updateFunc func(float64) float64) {
	const (
		maxBackoff     = 320 * time.Millisecond
		initialBackoff = 10 * time.Millisecond
	)
	backoff := initialBackoff

	for {
		loadedBits := atomic.LoadUint64(bits)
		oldFloat := math.Float64frombits(loadedBits)
		newFloat := updateFunc(oldFloat)
		newBits := math.Float64bits(newFloat)

		if atomic.CompareAndSwapUint64(bits, loadedBits, newBits) {
			break
		} else {
			// Use time.Since to wait without syscalls
			start := time.Now()
			for time.Since(start) < backoff {
				// Yield to the scheduler to prevent hogging the CPU
				runtime.Gosched()
			}
			// Exponential backoff with cap
			backoff *= 2
			if backoff > maxBackoff {
				backoff = maxBackoff
			}
		}
	}
}

It is better from performance POV:

goos: linux
goarch: amd64
pkg: github.com/prometheus/client_golang/prometheus
cpu: INTEL(R) XEON(R) GOLD 5512U
                                     │     sleep     │                since                │
                                     │    sec/op     │    sec/op     vs base               │
AtomicUpdateFloat_1Goroutine-56         41.49n ± 13%   22.48n ±  2%  -45.84% (p=0.002 n=6)
AtomicUpdateFloat_2Goroutines-56        49.48n ±  6%   22.56n ±  2%  -54.40% (p=0.002 n=6)
AtomicUpdateFloat_4Goroutines-56        58.94n ±  3%   22.82n ±  3%  -61.27% (p=0.002 n=6)
AtomicUpdateFloat_8Goroutines-56        64.47n ±  4%   23.20n ±  1%  -64.02% (p=0.002 n=6)
AtomicUpdateFloat_16Goroutines-56       73.23n ±  8%   23.49n ±  1%  -67.91% (p=0.002 n=6)
AtomicUpdateFloat_32Goroutines-56       86.44n ±  7%   24.49n ±  3%  -71.67% (p=0.002 n=6)
AtomicUpdateFloat_64Goroutines-56       97.09n ±  5%   26.35n ±  1%  -72.86% (p=0.002 n=6)
AtomicUpdateFloat_128Goroutines-56     110.50n ±  4%   30.08n ± 11%  -72.78% (p=0.002 n=6)
AtomicUpdateFloat_256Goroutines-56     120.45n ±  8%   35.05n ± 12%  -70.90% (p=0.002 n=6)
geomean                                 64.44n         24.66n        -61.74%

but total disaster from CPU consumption (this is 28-core xeon):

Sleep variant consumes almost no CPU (especially sys):

or maybe you could use a time.Ticker if we could find a way to make it exponential.

I tried to craft implementation, but it look'd so awful i decided it should not see the light of the day =)

Another approach was to use time.NewTimer(0) and reset it with new backoff exponentially, but this implementation was 2 orders of magnitude slower than simple and honest Sleep =)

What do you think?

From what i see, just use time.Sleep is strikes nice balance between (good) performance and (almost no) resource consumption

bwplotka

Love it, super clean, well benchmarked and clear benefit. There is a slight slow down for single thread indeed, but negligible vs benefits for more goroutines.

LGTM, thanks!

bwplotka · 2024-11-11T16:18:24Z

prometheus/atomic_update_test.go

+	output = val
+}
+
+func BenchmarkAtomicUpdateFloat_1Goroutine(b *testing.B) {


It feels we could use b.Run here as I was proposing in https://www.bwplotka.dev/2024/go-microbenchmarks-benchstat/

Any good reasons to NOT do it? (e.g. "goroutines=%v")

Nevertheless we can do later, not a blocker for this PR, thanks!

bwplotka · 2024-11-11T16:28:14Z

Epic work @ArthurSens on checking alternatives, and @imorph for trying those out ❤️

There is also #1642 that discusses more complex algorithms for concurrency (adaptive locks), we can continue the discussion there.

* add: exponential backoff for CAS operations of floats Signed-off-by: Ivan Goncharov <[email protected]> * add: some more benchmark use cases (higher contention) Signed-off-by: Ivan Goncharov <[email protected]> * fmt: fumpted some files Signed-off-by: Ivan Goncharov <[email protected]> * add: license header Signed-off-by: Ivan Goncharov <[email protected]> * add: comment explaining origin of backoff constants Signed-off-by: Ivan Goncharov <[email protected]> --------- Signed-off-by: Ivan Goncharov <[email protected]>

ywwg · 2024-11-19T18:49:27Z

I think this change broke histogram_test for me -- the test hangs and does not complete after 10 minutes and gets killed. I raised the issue in the CNCF slack and bryan noted that all the time is spent in the Sleep: https://cloud-native.slack.com/archives/C037NMVUAP3/p1732032254779259

Reverting this change fixes the issue for me.

ywwg · 2024-11-19T18:54:07Z

yes it appears that the test is highly contentious and the sleeps are overly aggressive in trying to resolve that contention, causing runtime to be much longer than it needs to be

ywwg · 2024-11-19T19:49:07Z

I believe adding a sleep to the TestHistogramAtomicObserve test in the observe() func would resolve the issue. @bwplotka can you take a look?

imorph added 4 commits October 26, 2024 13:14

add: exponential backoff for CAS operations of floats

9829910

Signed-off-by: Ivan Goncharov <[email protected]>

add: some more benchmark use cases (higher contention)

27067c1

Signed-off-by: Ivan Goncharov <[email protected]>

fmt: fumpted some files

d06c03f

Signed-off-by: Ivan Goncharov <[email protected]>

add: license header

62d6f57

Signed-off-by: Ivan Goncharov <[email protected]>

imorph mentioned this pull request Oct 27, 2024

Faster h.findBucket(v) #1662

Closed

ArthurSens mentioned this pull request Oct 31, 2024

Flaky test: TestSummaryDecay #1666

Closed

ArthurSens reviewed Nov 3, 2024

View reviewed changes

add: comment explaining origin of backoff constants

ba58fd6

Signed-off-by: Ivan Goncharov <[email protected]>

kakkoyun added this to the v1.21.0 milestone Nov 11, 2024

kakkoyun requested a review from beorn7 November 11, 2024 14:13

bwplotka approved these changes Nov 11, 2024

View reviewed changes

bwplotka merged commit a934c35 into prometheus:main Nov 11, 2024
10 checks passed

bwplotka mentioned this pull request Nov 11, 2024

Is Counter.Add() thread-safe? #1642

Open

imorph mentioned this pull request Nov 20, 2024

fix: add very small delay between observations in TestHistogramAtomicObserve #1691

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add: exponential backoff for CAS operations on floats #1661

Add: exponential backoff for CAS operations on floats #1661

imorph commented Oct 26, 2024

imorph commented Oct 26, 2024

sean- commented Oct 29, 2024

kakkoyun commented Oct 31, 2024

ArthurSens left a comment •

edited

Loading

ArthurSens Nov 3, 2024

imorph commented Nov 5, 2024

ArthurSens commented Nov 5, 2024 •

edited

Loading

imorph commented Nov 9, 2024 •

edited

Loading

bwplotka left a comment

bwplotka Nov 11, 2024

bwplotka commented Nov 11, 2024

ywwg commented Nov 19, 2024

ywwg commented Nov 19, 2024

ywwg commented Nov 19, 2024

Add: exponential backoff for CAS operations on floats #1661

Add: exponential backoff for CAS operations on floats #1661

Conversation

imorph commented Oct 26, 2024

What problem this PR is solving

Proposed changes

Results

CPU profiles

Downsides

Further improvements

imorph commented Oct 26, 2024

sean- commented Oct 29, 2024

kakkoyun commented Oct 31, 2024

ArthurSens left a comment • edited Loading

Choose a reason for hiding this comment

ArthurSens Nov 3, 2024

Choose a reason for hiding this comment

imorph commented Nov 5, 2024

ArthurSens commented Nov 5, 2024 • edited Loading

imorph commented Nov 9, 2024 • edited Loading

bwplotka left a comment

Choose a reason for hiding this comment

bwplotka Nov 11, 2024

Choose a reason for hiding this comment

bwplotka commented Nov 11, 2024

ywwg commented Nov 19, 2024

ywwg commented Nov 19, 2024

ywwg commented Nov 19, 2024

ArthurSens left a comment •

edited

Loading

ArthurSens commented Nov 5, 2024 •

edited

Loading

imorph commented Nov 9, 2024 •

edited

Loading