Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add: exponential backoff for CAS operations on floats #1661

Merged
merged 5 commits into from
Nov 11, 2024

Conversation

imorph
Copy link
Contributor

@imorph imorph commented Oct 26, 2024

What problem this PR is solving

Hi!
This kind of issues: cockroachdb/cockroach#133306 pushed me to investigate if it is possible to optimize client metrics library in terms of CPU and performance especially under contention. So mostly this PR is about that.

Proposed changes

Add sort of exponential backoff for tight loops on CAS operations, which potentially will decrease contention and as a result will lead to better latencies and lower CPU consumption. This is well known trick that is used in many projects that deal with concurrency.

In addition this logic was refactored into single place in code because case of atomic update of float64 can be meet in different parts of the codebase.

Results

All test results are from AWS c7i.2xlarge type of instance.

main vs proposed for histograms:

goos: linux
goarch: amd64
pkg: github.com/prometheus/client_golang/prometheus
cpu: Intel(R) Xeon(R) Platinum 8488C
                           │ ../../bhOLD.txt │           ../../bhNEW.txt           │
                           │     sec/op      │    sec/op     vs base               │
HistogramWithLabelValues-8      87.81n ±  1%   90.36n ±  1%   +2.90% (p=0.002 n=6)
HistogramNoLabels-8             36.15n ±  1%   39.55n ±  0%   +9.42% (p=0.002 n=6)
HistogramObserve1-8             31.23n ±  1%   32.52n ±  1%   +4.15% (p=0.002 n=6)
HistogramObserve2-8            242.45n ± 16%   86.89n ±  0%  -64.16% (p=0.002 n=6)
HistogramObserve4-8             372.8n ± 14%   175.0n ± 19%  -53.05% (p=0.002 n=6)
HistogramObserve8-8             953.0n ±  1%   415.4n ± 32%  -56.41% (p=0.002 n=6)
HistogramWrite1-8               1.456µ ± 11%   1.454µ ± 12%        ~ (p=0.699 n=6)
HistogramWrite2-8               3.103µ ±  3%   3.125µ ±  7%        ~ (p=0.485 n=6)
HistogramWrite4-8               6.087µ ±  3%   6.041µ ±  4%        ~ (p=0.937 n=6)
HistogramWrite8-8               11.71µ ±  2%   11.90µ ±  3%   +1.61% (p=0.004 n=6)
geomean                         554.5n         434.5n        -21.64%

                           │ ../../bhOLD.txt │          ../../bhNEW.txt           │
                           │      B/op       │    B/op     vs base                │
HistogramWithLabelValues-8      0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
HistogramNoLabels-8             0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
geomean                                    ²               +0.00%               ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                           │ ../../bhOLD.txt │          ../../bhNEW.txt           │
                           │    allocs/op    │ allocs/op   vs base                │
HistogramWithLabelValues-8      0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
HistogramNoLabels-8             0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
geomean                                    ²               +0.00%               ²
¹ all samples are equal
² summaries must be >0 to compute geomean

changes leading to slight increase of latency in single-threaded use case but significantly decrease it in contended case.

main vs proposed for summaries:

goos: linux
goarch: amd64
pkg: github.com/prometheus/client_golang/prometheus
cpu: Intel(R) Xeon(R) Platinum 8488C
                         │ ../../bsOLD.txt │           ../../bsNEW.txt           │
                         │     sec/op      │    sec/op     vs base               │
SummaryWithLabelValues-8      264.0n ±  1%   267.6n ±  3%   +1.34% (p=0.017 n=6)
SummaryNoLabels-8             186.6n ±  1%   186.6n ±  0%        ~ (p=0.920 n=6)
SummaryObserve1-8             22.57n ±  1%   23.66n ±  0%   +4.83% (p=0.002 n=6)
SummaryObserve2-8            192.75n ± 31%   52.13n ±  1%  -72.95% (p=0.002 n=6)
SummaryObserve4-8             297.9n ± 27%   126.7n ± 16%  -57.47% (p=0.002 n=6)
SummaryObserve8-8             937.5n ±  5%   319.1n ± 48%  -65.96% (p=0.002 n=6)
SummaryWrite1-8               135.1n ± 32%   134.1n ±  9%        ~ (p=0.485 n=6)
SummaryWrite2-8               319.9n ±  3%   327.4n ±  4%        ~ (p=0.240 n=6)
SummaryWrite4-8               665.7n ±  5%   664.0n ±  9%        ~ (p=0.699 n=6)
SummaryWrite8-8               1.402µ ±  9%   1.318µ ± 11%        ~ (p=0.180 n=6)
geomean                       274.3n         198.6n        -27.59%

                         │ ../../bsOLD.txt │          ../../bsNEW.txt           │
                         │      B/op       │    B/op     vs base                │
SummaryWithLabelValues-8      0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
SummaryNoLabels-8             0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
geomean                                  ²               +0.00%               ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                         │ ../../bsOLD.txt │          ../../bsNEW.txt           │
                         │    allocs/op    │ allocs/op   vs base                │
SummaryWithLabelValues-8      0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
SummaryNoLabels-8             0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
geomean                                  ²               +0.00%               ²
¹ all samples are equal
² summaries must be >0 to compute geomean

same is true for summaries as well

Some additional illustrations:
Gap between implementations is wider for higher number of parallel/contending goroutines.
image
image

Side by side comparison of backoff/no-backoff implementations are in atomic_update_test.go file:

BenchmarkAtomicUpdateFloat_SingleGoroutine-8   	94284844	       12.59 ns/op
BenchmarkAtomicNoBackoff_SingleGoroutine-8     	97815223	       12.24 ns/op
BenchmarkAtomicUpdateFloat_1Goroutine-8        	91764830	       15.45 ns/op
BenchmarkAtomicNoBackoff_1Goroutine-8          	13282502	       90.47 ns/op
BenchmarkAtomicUpdateFloat_2Goroutines-8       	67590978	       19.70 ns/op
BenchmarkAtomicNoBackoff_2Goroutines-8         	13490074	       89.20 ns/op
BenchmarkAtomicUpdateFloat_4Goroutines-8       	88974640	       14.74 ns/op
BenchmarkAtomicNoBackoff_4Goroutines-8         	13480230	       89.29 ns/op
BenchmarkAtomicUpdateFloat_8Goroutines-8       	85171494	       13.94 ns/op
BenchmarkAtomicNoBackoff_8Goroutines-8         	13474435	       89.39 ns/op
BenchmarkAtomicUpdateFloat_16Goroutines-8      	80687557	       14.96 ns/op
BenchmarkAtomicNoBackoff_16Goroutines-8        	13556298	       90.21 ns/op
BenchmarkAtomicUpdateFloat_32Goroutines-8      	71862175	       14.98 ns/op
BenchmarkAtomicNoBackoff_32Goroutines-8        	13522851	       88.90 ns/op

CPU profiles

for go test -bench='BenchmarkHistogramObserve*' -run=^# -count=6
old top:

      flat  flat%   sum%        cum   cum%
     0.56s  0.34%  0.34%    166.28s 99.93%  github.com/prometheus/client_golang/prometheus.benchmarkHistogramObserve.func1
     0.57s  0.34%  0.68%    165.71s 99.59%  github.com/prometheus/client_golang/prometheus.(*histogram).Observe
    32.65s 19.62% 20.30%    128.03s 76.94%  github.com/prometheus/client_golang/prometheus.(*histogram).observe
    56.26s 33.81% 54.11%     95.38s 57.32%  github.com/prometheus/client_golang/prometheus.(*histogramCounts).observe
    29.38s 17.66% 71.77%     39.12s 23.51%  github.com/prometheus/client_golang/prometheus.atomicAddFloat (inline)
     0.25s  0.15% 71.92%     37.11s 22.30%  github.com/prometheus/client_golang/prometheus.(*histogram).findBucket
    33.38s 20.06% 91.98%     36.86s 22.15%  sort.SearchFloat64s (inline)
     9.10s  5.47% 97.45%      9.10s  5.47%  math.Float64frombits (inline)
     2.05s  1.23% 98.68%      3.46s  2.08%  sort.Search
     1.41s  0.85% 99.53%      1.41s  0.85%  github.com/prometheus/client_golang/prometheus.(*histogram).findBucket.SearchFloat64s.func1

new top:

      flat  flat%   sum%        cum   cum%
     0.48s  1.94%  1.94%     24.61s 99.31%  github.com/prometheus/client_golang/prometheus.benchmarkHistogramObserve.func1
     0.41s  1.65%  3.59%     24.13s 97.38%  github.com/prometheus/client_golang/prometheus.(*histogram).Observe
     7.89s 31.84% 35.43%     19.77s 79.78%  github.com/prometheus/client_golang/prometheus.(*histogram).observe
        5s 20.18% 55.61%     11.88s 47.94%  github.com/prometheus/client_golang/prometheus.(*histogramCounts).observe
         0     0% 55.61%      6.88s 27.76%  github.com/prometheus/client_golang/prometheus.atomicAddFloat (inline)
     5.99s 24.17% 79.78%      6.88s 27.76%  github.com/prometheus/client_golang/prometheus.atomicUpdateFloat
     0.18s  0.73% 80.51%      3.95s 15.94%  github.com/prometheus/client_golang/prometheus.(*histogram).findBucket
     0.51s  2.06% 82.57%      3.77s 15.21%  sort.SearchFloat64s (inline)
     2.01s  8.11% 90.68%      3.26s 13.16%  sort.Search
     1.25s  5.04% 95.72%      1.25s  5.04%  github.com/prometheus/client_golang/prometheus.(*histogram).findBucket.SearchFloat64s.func1

Downsides

Everything comes with the price, so in this case

  • code is slightly more complex
  • some magic constants introduced (10ms and 320ms -- mostly result of empirical testing)
  • single-threaded performance weaker by 4%-5%
  • time.sleep introducing additional syscall (? not sure about that)

Further improvements

  • Some random jitter can be added to avoid thundering herd problem
  • I briefly explored hybrid approach when there is staged backoff with first stage using runtime.Gosched() and only then time.sleep but it was not looking impressive from results point of view
  • better code layout?

@vesari
@ArthurSens
@bwplotka
@kakkoyun

@imorph
Copy link
Contributor Author

imorph commented Oct 26, 2024

TestSummaryDecay test is failing, but this fail seems unrelated to my changes (may be I missing something but looks like it is different code path).

Most likely it fils because in the summary struct, observations added via Observe are buffered in hotBuf. When hotBuf reaches its capacity or expires, an asynchronous flush is triggered by calling asyncFlush. This flush moves the observations to coldBuf and processes them in a separate goroutine. The Write method processes the coldBuf synchronously but does not include observations still in hotBuf or those being processed asynchronously in the background goroutine. Test is rapidly adding observations and periodically calling sum.Write(). Due to the asynchronous processing of observations, some observations may not have been fully processed and included in the quantile calculations when Write is called. This could lead to random test failures.

By i might be wrong.

@imorph imorph mentioned this pull request Oct 27, 2024
@sean-
Copy link

sean- commented Oct 29, 2024

Prom histograms have historically been slow relative to other histogram implementations.

https://github.com/sean-/bench-go-histograms/tree/main?tab=readme-ov-file#benchmark-for-go-histogram-implementations

@kakkoyun
Copy link
Member

I plan to review this PR this weekend. Sorry for the tardiness.

Copy link
Member

@ArthurSens ArthurSens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance PRs are accumulating across different repos. I'm taking this weekend to study and unblock them. For this PR, I had to read more about sync/atomic. I wasn't familiar with it, which is probably why I was procrastinating the review here so much. Apologies!

Downsides
Everything comes with the price, so in this case

code is slightly more complex

What you are doing is just an exponential backoff, it doesn't sound too complex :)
The complicated part for me was understanding the already existing code around atomic additions and updates.

some magic constants introduced (10ms and 320ms -- mostly result of empirical testing)

I'm fine with keeping this number, even if other machines might have better performance with slightly different values overall it should be an improvement to all of them. But let's add a comment explaining where they come from. You can even link your PR description in the code comment so people can easily look at your benchmarks.

single-threaded performance weaker by 4%-5%

That should be fine, single-threaded environments are rare nowadays.

time.sleep introducing additional syscall (? not sure about that)

This is only called if the first CAS fails, which doesn't happen on the happy path.
I couldn't understand what you mean by "additional syscall". Are you concerned that we're doing any kind of syscall that we weren't doing before? Or the number of syscalls that are potentially increasing?

Comment on lines 25 to 28
const (
maxBackoff = 320 * time.Millisecond
initialBackoff = 10 * time.Millisecond
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a comment explaining that those numbers were chosen based on empirical testing in one particular environment (AWS c7i.2xlarge) ?

@imorph
Copy link
Contributor Author

imorph commented Nov 5, 2024

@ArthurSens

This is only called if the first CAS fails, which doesn't happen on the happy path.

It is only happy in single-threaded case, as long as we have contending goroutines trying to CAS atomic value, inevitably we have a lot of sleeps which produce (still unsure =) ) syscalls and it is not free. Some projects sensitive to increasing number of syscalls, latest example from same folks at cockroachdb: cockroachdb/cockroach#133315

Or the number of syscalls that are potentially increasing?

yes, exactly that

@ArthurSens
Copy link
Member

ArthurSens commented Nov 5, 2024

Golang's sleep implementation surely does syscalls, how much of a problem that is uncertain to me.

One suggestion from the link you provided is to use time.Since or maybe you could use a time.Ticker if we could find a way to make it exponential. What do you think?

@imorph
Copy link
Contributor Author

imorph commented Nov 9, 2024

@ArthurSens

you provided is to use time.Since

i tried this implementation which avoids sleeps by using combination of since and gosched:

func atomicUpdateFloat(bits *uint64, updateFunc func(float64) float64) {
	const (
		maxBackoff     = 320 * time.Millisecond
		initialBackoff = 10 * time.Millisecond
	)
	backoff := initialBackoff

	for {
		loadedBits := atomic.LoadUint64(bits)
		oldFloat := math.Float64frombits(loadedBits)
		newFloat := updateFunc(oldFloat)
		newBits := math.Float64bits(newFloat)

		if atomic.CompareAndSwapUint64(bits, loadedBits, newBits) {
			break
		} else {
			// Use time.Since to wait without syscalls
			start := time.Now()
			for time.Since(start) < backoff {
				// Yield to the scheduler to prevent hogging the CPU
				runtime.Gosched()
			}
			// Exponential backoff with cap
			backoff *= 2
			if backoff > maxBackoff {
				backoff = maxBackoff
			}
		}
	}
}

It is better from performance POV:

goos: linux
goarch: amd64
pkg: github.com/prometheus/client_golang/prometheus
cpu: INTEL(R) XEON(R) GOLD 5512U
                                     │     sleep     │                since                │
                                     │    sec/op     │    sec/op     vs base               │
AtomicUpdateFloat_1Goroutine-56         41.49n ± 13%   22.48n ±  2%  -45.84% (p=0.002 n=6)
AtomicUpdateFloat_2Goroutines-56        49.48n ±  6%   22.56n ±  2%  -54.40% (p=0.002 n=6)
AtomicUpdateFloat_4Goroutines-56        58.94n ±  3%   22.82n ±  3%  -61.27% (p=0.002 n=6)
AtomicUpdateFloat_8Goroutines-56        64.47n ±  4%   23.20n ±  1%  -64.02% (p=0.002 n=6)
AtomicUpdateFloat_16Goroutines-56       73.23n ±  8%   23.49n ±  1%  -67.91% (p=0.002 n=6)
AtomicUpdateFloat_32Goroutines-56       86.44n ±  7%   24.49n ±  3%  -71.67% (p=0.002 n=6)
AtomicUpdateFloat_64Goroutines-56       97.09n ±  5%   26.35n ±  1%  -72.86% (p=0.002 n=6)
AtomicUpdateFloat_128Goroutines-56     110.50n ±  4%   30.08n ± 11%  -72.78% (p=0.002 n=6)
AtomicUpdateFloat_256Goroutines-56     120.45n ±  8%   35.05n ± 12%  -70.90% (p=0.002 n=6)
geomean                                 64.44n         24.66n        -61.74%

but total disaster from CPU consumption (this is 28-core xeon):
image
image

Sleep variant consumes almost no CPU (especially sys):
image
image

or maybe you could use a time.Ticker if we could find a way to make it exponential.

I tried to craft implementation, but it look'd so awful i decided it should not see the light of the day =)

Another approach was to use time.NewTimer(0) and reset it with new backoff exponentially, but this implementation was 2 orders of magnitude slower than simple and honest Sleep =)

What do you think?

From what i see, just use time.Sleep is strikes nice balance between (good) performance and (almost no) resource consumption

@kakkoyun kakkoyun added this to the v1.21.0 milestone Nov 11, 2024
@kakkoyun kakkoyun requested a review from beorn7 November 11, 2024 14:13
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it, super clean, well benchmarked and clear benefit. There is a slight slow down for single thread indeed, but negligible vs benefits for more goroutines.

LGTM, thanks!

output = val
}

func BenchmarkAtomicUpdateFloat_1Goroutine(b *testing.B) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels we could use b.Run here as I was proposing in https://www.bwplotka.dev/2024/go-microbenchmarks-benchstat/

Any good reasons to NOT do it? (e.g. "goroutines=%v")

Nevertheless we can do later, not a blocker for this PR, thanks!

@bwplotka bwplotka merged commit a934c35 into prometheus:main Nov 11, 2024
10 checks passed
@bwplotka
Copy link
Member

Epic work @ArthurSens on checking alternatives, and @imorph for trying those out ❤️

There is also #1642 that discusses more complex algorithms for concurrency (adaptive locks), we can continue the discussion there.

shivanthzen pushed a commit to shivanthzen/client_golang that referenced this pull request Nov 13, 2024
* add: exponential backoff for CAS operations of floats

Signed-off-by: Ivan Goncharov <[email protected]>

* add: some more benchmark use cases (higher contention)

Signed-off-by: Ivan Goncharov <[email protected]>

* fmt: fumpted some files

Signed-off-by: Ivan Goncharov <[email protected]>

* add: license header

Signed-off-by: Ivan Goncharov <[email protected]>

* add: comment explaining origin of backoff constants

Signed-off-by: Ivan Goncharov <[email protected]>

---------

Signed-off-by: Ivan Goncharov <[email protected]>
@ywwg
Copy link
Member

ywwg commented Nov 19, 2024

I think this change broke histogram_test for me -- the test hangs and does not complete after 10 minutes and gets killed. I raised the issue in the CNCF slack and bryan noted that all the time is spent in the Sleep: https://cloud-native.slack.com/archives/C037NMVUAP3/p1732032254779259

Reverting this change fixes the issue for me.

@ywwg
Copy link
Member

ywwg commented Nov 19, 2024

yes it appears that the test is highly contentious and the sleeps are overly aggressive in trying to resolve that contention, causing runtime to be much longer than it needs to be

@ywwg
Copy link
Member

ywwg commented Nov 19, 2024

I believe adding a sleep to the TestHistogramAtomicObserve test in the observe() func would resolve the issue. @bwplotka can you take a look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants