-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add: exponential backoff for CAS operations on floats #1661
Add: exponential backoff for CAS operations on floats #1661
Conversation
Signed-off-by: Ivan Goncharov <[email protected]>
Signed-off-by: Ivan Goncharov <[email protected]>
Signed-off-by: Ivan Goncharov <[email protected]>
Signed-off-by: Ivan Goncharov <[email protected]>
Most likely it fils because in the By i might be wrong. |
Prom histograms have historically been slow relative to other histogram implementations. |
I plan to review this PR this weekend. Sorry for the tardiness. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Performance PRs are accumulating across different repos. I'm taking this weekend to study and unblock them. For this PR, I had to read more about sync/atomic
. I wasn't familiar with it, which is probably why I was procrastinating the review here so much. Apologies!
Downsides
Everything comes with the price, so in this casecode is slightly more complex
What you are doing is just an exponential backoff, it doesn't sound too complex :)
The complicated part for me was understanding the already existing code around atomic additions and updates.
some magic constants introduced (10ms and 320ms -- mostly result of empirical testing)
I'm fine with keeping this number, even if other machines might have better performance with slightly different values overall it should be an improvement to all of them. But let's add a comment explaining where they come from. You can even link your PR description in the code comment so people can easily look at your benchmarks.
single-threaded performance weaker by 4%-5%
That should be fine, single-threaded environments are rare nowadays.
time.sleep introducing additional syscall (? not sure about that)
This is only called if the first CAS fails, which doesn't happen on the happy path.
I couldn't understand what you mean by "additional syscall". Are you concerned that we're doing any kind of syscall that we weren't doing before? Or the number of syscalls that are potentially increasing?
prometheus/atomic_update.go
Outdated
const ( | ||
maxBackoff = 320 * time.Millisecond | ||
initialBackoff = 10 * time.Millisecond | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a comment explaining that those numbers were chosen based on empirical testing in one particular environment (AWS c7i.2xlarge) ?
It is only happy in single-threaded case, as long as we have contending goroutines trying to CAS atomic value, inevitably we have a lot of sleeps which produce (still unsure =) ) syscalls and it is not free. Some projects sensitive to increasing number of syscalls, latest example from same folks at cockroachdb: cockroachdb/cockroach#133315
yes, exactly that |
Signed-off-by: Ivan Goncharov <[email protected]>
Golang's sleep implementation surely does syscalls, how much of a problem that is uncertain to me. One suggestion from the link you provided is to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love it, super clean, well benchmarked and clear benefit. There is a slight slow down for single thread indeed, but negligible vs benefits for more goroutines.
LGTM, thanks!
output = val | ||
} | ||
|
||
func BenchmarkAtomicUpdateFloat_1Goroutine(b *testing.B) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels we could use b.Run
here as I was proposing in https://www.bwplotka.dev/2024/go-microbenchmarks-benchstat/
Any good reasons to NOT do it? (e.g. "goroutines=%v"
)
Nevertheless we can do later, not a blocker for this PR, thanks!
Epic work @ArthurSens on checking alternatives, and @imorph for trying those out ❤️ There is also #1642 that discusses more complex algorithms for concurrency (adaptive locks), we can continue the discussion there. |
* add: exponential backoff for CAS operations of floats Signed-off-by: Ivan Goncharov <[email protected]> * add: some more benchmark use cases (higher contention) Signed-off-by: Ivan Goncharov <[email protected]> * fmt: fumpted some files Signed-off-by: Ivan Goncharov <[email protected]> * add: license header Signed-off-by: Ivan Goncharov <[email protected]> * add: comment explaining origin of backoff constants Signed-off-by: Ivan Goncharov <[email protected]> --------- Signed-off-by: Ivan Goncharov <[email protected]>
I think this change broke histogram_test for me -- the test hangs and does not complete after 10 minutes and gets killed. I raised the issue in the CNCF slack and bryan noted that all the time is spent in the Sleep: https://cloud-native.slack.com/archives/C037NMVUAP3/p1732032254779259 Reverting this change fixes the issue for me. |
yes it appears that the test is highly contentious and the sleeps are overly aggressive in trying to resolve that contention, causing runtime to be much longer than it needs to be |
I believe adding a sleep to the TestHistogramAtomicObserve test in the observe() func would resolve the issue. @bwplotka can you take a look? |
What problem this PR is solving
Hi!
This kind of issues: cockroachdb/cockroach#133306 pushed me to investigate if it is possible to optimize client metrics library in terms of CPU and performance especially under contention. So mostly this PR is about that.
Proposed changes
Add sort of exponential backoff for tight loops on CAS operations, which potentially will decrease contention and as a result will lead to better latencies and lower CPU consumption. This is well known trick that is used in many projects that deal with concurrency.
In addition this logic was refactored into single place in code because case of atomic update of float64 can be meet in different parts of the codebase.
Results
All test results are from AWS c7i.2xlarge type of instance.
main vs proposed for histograms:
changes leading to slight increase of latency in single-threaded use case but significantly decrease it in contended case.
main vs proposed for summaries:
same is true for summaries as well
Some additional illustrations:
Gap between implementations is wider for higher number of parallel/contending goroutines.
Side by side comparison of backoff/no-backoff implementations are in
atomic_update_test.go
file:CPU profiles
for
go test -bench='BenchmarkHistogramObserve*' -run=^# -count=6
old top:
new top:
Downsides
Everything comes with the price, so in this case
time.sleep
introducing additional syscall (? not sure about that)Further improvements
runtime.Gosched()
and only thentime.sleep
but it was not looking impressive from results point of view@vesari
@ArthurSens
@bwplotka
@kakkoyun