-
Notifications
You must be signed in to change notification settings - Fork 324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: Move heavy object deallocation off the critical Execution path #3241
base: master
Are you sure you want to change the base?
Conversation
Reuse State Manager's deallocator thread in Execution, moving up to half the work under heavy load onto a background thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks!
@@ -418,6 +421,7 @@ impl ExecutionEnvironment { | |||
Arc::clone(&ingress_history_writer), | |||
fd_factory, | |||
); | |||
let deallocator_thread = DeallocatorThread::new("ExecutionDeallocator", 10000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you decide on 10k?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You hit the nail on the head. (o:
The State Manager deallocator thread would originally sleep for 1 ms between successive drops. So I used 1000
(as in 1 s / 1000
) as the factor there, to preserve the behavior.
I initially started with 1000 here too, but while running my (very heavy) benchmarks I noticed that half the deallocations still happened on the scheduler threads. So I just bumped it by 10x because I wanted to get some results quickly and one benchmark run would take on the order of 15 minutes.
On the one hand, I did not immediately notice any negative effects from sleeping 100 µs between drops. And we are likely to both benefit from and need to deallocate lots of large objects under extremely heavy load. Under normal load, the 10k will neither help, nor have any effect.
That being said, I can try to fine tune it. It's quite likely that we can reduce the factor (and increase the interval) by at least 2x without negative effects on throughput. (The peak message throughput I saw in my testing was about 2.5k messages/second, so assuming 2 objects to deallocate per message, we could likely do with 5k objects sent to the deallocator thread per second, At 1 object per message, 2.5k would do.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done a couple of runs, with 3k and 5k. In both cases, some SystemStates
and ExecutionStates
are being dropped synchronously, because the channel to the deallocator thread becomes backlogged. With 3k (333µs sleep), it's about 20%; with 5k (200 µs sleep) it's about 13%.
FWIW, I've never noticed any synchronous drops with 10k (100 µs sleep). But then the overhead to execution (12% and respectively 9%; only under very heavy load) is not really meaningful with either 3k or 5k.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It needs some comment explaining why 10000 in the code and when to readjust it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add a doc comment, thanks for the suggestion.
What's the benchmark that the flamegraphs are coming from? |
The benchmark is simply a run of |
@@ -345,6 +347,7 @@ pub struct ExecutionEnvironment { | |||
// parallel and potentially reserving resources. It should be initialized to | |||
// the number of scheduler cores. | |||
resource_saturation_scaling: usize, | |||
deallocator_thread: DeallocatorThread, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, it would be better to flush this when dropping an ExecutionEnvironment
during tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I just impl Drop for ExecutionEnvironment
or do you have something more specific in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes exactly. You can see the same pattern in the state manager
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test failures on the PR also seem to suggest that flushing the thread on dropping ExecutionEnvironment
for tests could be a good idea (note that I didn't look into the failures in detail, so it might well be something else; just saying this sounds like a plausible cause).
@@ -345,6 +347,7 @@ pub struct ExecutionEnvironment { | |||
// parallel and potentially reserving resources. It should be initialized to | |||
// the number of scheduler cores. | |||
resource_saturation_scaling: usize, | |||
deallocator_thread: DeallocatorThread, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test failures on the PR also seem to suggest that flushing the thread on dropping ExecutionEnvironment
for tests could be a good idea (note that I didn't look into the failures in detail, so it might well be something else; just saying this sounds like a plausible cause).
|
||
/// A thread that deallocates complex objects in the background. It spreads the | ||
/// cost of deallocation over a longer period of time, to avoid long pauses. | ||
pub struct DeallocatorThread { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like that this is encapsulated now! One question that crossed my mind: do we also need some tests?
@@ -418,6 +421,7 @@ impl ExecutionEnvironment { | |||
Arc::clone(&ingress_history_writer), | |||
fd_factory, | |||
); | |||
let deallocator_thread = DeallocatorThread::new("ExecutionDeallocator", 10000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It needs some comment explaining why 10000 in the code and when to readjust it
}) | ||
.expect("failed to spawn background deallocation thread"), | ||
); | ||
let deallocator_thread = DeallocatorThread::new("StateDeallocation", 1000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we pass a duration so that the numbers are less cryptic?
Reuse State Manager's deallocator thread in Execution, moving up to half the work under heavy load onto a background thread.