Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Move heavy object deallocation off the critical Execution path #3241

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

alin-at-dfinity
Copy link
Contributor

Reuse State Manager's deallocator thread in Execution, moving up to half the work under heavy load onto a background thread.

Reuse State Manager's deallocator thread in Execution, moving up to half the work under heavy load onto a background thread.
@github-actions github-actions bot added the perf label Dec 18, 2024
@alin-at-dfinity
Copy link
Contributor Author

Before and after CPU profiles, while running a best-effort message benchmark. Note how the big chunk of PageMap drop moves from the scheduler threads to a new ExecutionDeallocator thread.
Screenshot from 2024-12-13 13-52-06
Screenshot from 2024-12-15 19-29-24

Copy link
Contributor

@adambratschikaye adambratschikaye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks!

@@ -418,6 +421,7 @@ impl ExecutionEnvironment {
Arc::clone(&ingress_history_writer),
fd_factory,
);
let deallocator_thread = DeallocatorThread::new("ExecutionDeallocator", 10000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you decide on 10k?

Copy link
Contributor Author

@alin-at-dfinity alin-at-dfinity Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You hit the nail on the head. (o:

The State Manager deallocator thread would originally sleep for 1 ms between successive drops. So I used 1000 (as in 1 s / 1000) as the factor there, to preserve the behavior.

I initially started with 1000 here too, but while running my (very heavy) benchmarks I noticed that half the deallocations still happened on the scheduler threads. So I just bumped it by 10x because I wanted to get some results quickly and one benchmark run would take on the order of 15 minutes.

On the one hand, I did not immediately notice any negative effects from sleeping 100 µs between drops. And we are likely to both benefit from and need to deallocate lots of large objects under extremely heavy load. Under normal load, the 10k will neither help, nor have any effect.

That being said, I can try to fine tune it. It's quite likely that we can reduce the factor (and increase the interval) by at least 2x without negative effects on throughput. (The peak message throughput I saw in my testing was about 2.5k messages/second, so assuming 2 objects to deallocate per message, we could likely do with 5k objects sent to the deallocator thread per second, At 1 object per message, 2.5k would do.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done a couple of runs, with 3k and 5k. In both cases, some SystemStates and ExecutionStates are being dropped synchronously, because the channel to the deallocator thread becomes backlogged. With 3k (333µs sleep), it's about 20%; with 5k (200 µs sleep) it's about 13%.
Screenshot from 2024-12-19 11-52-37
Screenshot from 2024-12-19 11-52-50
FWIW, I've never noticed any synchronous drops with 10k (100 µs sleep). But then the overhead to execution (12% and respectively 9%; only under very heavy load) is not really meaningful with either 3k or 5k.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs some comment explaining why 10000 in the code and when to readjust it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a doc comment, thanks for the suggestion.

@adambratschikaye
Copy link
Contributor

What's the benchmark that the flamegraphs are coming from?

@alin-at-dfinity
Copy link
Contributor Author

What's the benchmark that the flamegraphs are coming from?

The benchmark is simply a run of ict test //rs/tests/message_routing/xnet:xnet_slo_3_subnets_test from a branch combining #2995 (tweaking the replica and test to run with best-effort messages) and #3153 (various performance improvements, including this one). The test will fail, but the point is to produce enough load to test the callback and best-effort message memory limits.

rs/utils/thread/src/deallocator_thread.rs Show resolved Hide resolved
@@ -345,6 +347,7 @@ pub struct ExecutionEnvironment {
// parallel and potentially reserving resources. It should be initialized to
// the number of scheduler cores.
resource_saturation_scaling: usize,
deallocator_thread: DeallocatorThread,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, it would be better to flush this when dropping an ExecutionEnvironment during tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I just impl Drop for ExecutionEnvironment or do you have something more specific in mind?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly. You can see the same pattern in the state manager

Copy link
Contributor

@derlerd-dfinity derlerd-dfinity Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test failures on the PR also seem to suggest that flushing the thread on dropping ExecutionEnvironment for tests could be a good idea (note that I didn't look into the failures in detail, so it might well be something else; just saying this sounds like a plausible cause).

@@ -345,6 +347,7 @@ pub struct ExecutionEnvironment {
// parallel and potentially reserving resources. It should be initialized to
// the number of scheduler cores.
resource_saturation_scaling: usize,
deallocator_thread: DeallocatorThread,
Copy link
Contributor

@derlerd-dfinity derlerd-dfinity Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test failures on the PR also seem to suggest that flushing the thread on dropping ExecutionEnvironment for tests could be a good idea (note that I didn't look into the failures in detail, so it might well be something else; just saying this sounds like a plausible cause).


/// A thread that deallocates complex objects in the background. It spreads the
/// cost of deallocation over a longer period of time, to avoid long pauses.
pub struct DeallocatorThread {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like that this is encapsulated now! One question that crossed my mind: do we also need some tests?

@@ -418,6 +421,7 @@ impl ExecutionEnvironment {
Arc::clone(&ingress_history_writer),
fd_factory,
);
let deallocator_thread = DeallocatorThread::new("ExecutionDeallocator", 10000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs some comment explaining why 10000 in the code and when to readjust it

})
.expect("failed to spawn background deallocation thread"),
);
let deallocator_thread = DeallocatorThread::new("StateDeallocation", 1000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we pass a duration so that the numbers are less cryptic?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants