I used to work at places that specialize in Ultra Low Latency (ULL) development, like Stock Exchanges and High Frequency Trading and the like. You end up having to bypass good abstractions and libraries because of the overhead. A good example is : It is rare ULL code that does a context switch (takes too long). Single threaded only. If you want to communicate w/ another process, you use shared memory. ULL code rarely calls epoll/select either. Best to avoid the kernel as well (evfi,dkdp,openonload,etc). So you end up doing some sort of while(!hasNewData()) {}
spinning event loop, waiting for memory to change, and when it does you enter callback-hell. The ergonomics suck. Or just give up and learn verilog, put your code on the card w/ the NIC and skip the PCIe bus/kernel/stack all together. Your shinny CPU now just branch predicting a glorified logger.
I recently did another project for a company in an adjacent industry and used koltin coroutines (not ULL, but still latency sensitive). It was great to write synchronous looking code that was concurrent. Coroutines made the code so much easier to write / reason about / read.
I have another ULL project coming up, so before that project happens I wanted to measure the overhead of coroutines vs callbacks. See how much the abstraction costs. See if the cost are a clear "no".
I didn't want the root event/IO being waited on to dominate the results. Which ruled out networking IO. I decided to used shared memory instead and do a ping-pong test. The clint puts a u64
in a memory location one, and times how long it takes that u64
number to show up in memory location two. The server is spinning waiting for location one to change value, when it does it copies the new value to location two, then loops. Here is the server-side zig code and the resulting asm
according to compiler explorer :
fn runLoop( clientPtr : *const u64, serverPtr : *u64 ) void {
var lastValue : u64 = 0;
while(true) {
lastValue = spinUntilChange( clientPtr, lastValue );
@atomicStore(u64, serverPtr, lastValue, std.builtin.AtomicOrder.Monotonic );
}
}
fn spinUntilChange( spinPtr:*const u64, lastValue:u64) callconv(.Inline) u64 {
var newValue = lastValue;
while( newValue == lastValue ) {
std.atomic.spinLoopHint();
newValue = @atomicLoad(u64, spinPtr, std.builtin.AtomicOrder.Monotonic );
}
return newValue;
} |
runLoop:
xor eax, eax
.LBB0_1:
pause
mov rcx, qword ptr [rdi]
cmp rcx, rax
je .LBB0_1
mov qword ptr [rsi], rcx
mov rax, rcx
jmp .LBB0_1 |
In C++,Rust & Zig I implemented the pointer as both volatile
and Atomic
. It made no difference in the asm
output. If you remove the spinLoopHint()
the AMD times get worse by ~50%. It seems to make no difference on Intel, so I left it in.
There is only one client, and it's in Rust, and it's a spin-loop. Mostly because the criterion benchmarking crate is really great at this kind of micro-benchmark work. The asm
for the client is similar to the server above.
For the coroutine server code, this is the kotlin version which is probably the easiest to follow :
suspend fun asyncLoop() {
var value = 0L
while(true) {
value = suspendPoint.suspendMe()
writeBuf.writeLong(0,value)
}
}
fun run() {
var nextValue = 0L;
var lastValue = 0L;
while(true) {
while(nextValue == lastValue) {
java.lang.Thread.onSpinWait();
nextValue = readBuf.readLong(0)
}
suspendPoint.resume( nextValue )
lastValue = nextValue
}
}
asyncLoop()
is called first, then run()
both from main()
. It's odd to see two while(true)
loops in single threaded code. The magic of coroutines.
All the coroutine version were written the same. The coroutine code suspends while the event loop spins on memory. When the coroutine code resumes w/ the new value, it writes it.
The code flow for the coroutine code looks like this :
stateDiagram
state "Client Measures This" as time
state time {
SpinLoop --> Resume
Resume --> WriteValue
}
WriteValue --> Suspend
Suspend --> SpinLoop
Since the client is measuring the time it takes from SpinLoop
to WriteValue
, it most likely isn't measuring the time Suspend
takes. There is no reason to think the time to Resume
is the same as to Suspend
.
If the event loop is dispatching the same event to multiple listeners, or had multiple events to dispatch, you bear the cost of both the suspend and resume for each event.
So I wrote two version of each coroutine. The version above is called the Resume
version. The Suspend
version code flow looks like :
stateDiagram
state "Client Measures This" as time
state time {
SpinLoop --> Suspend
Suspend --> WriteValue
}
WriteValue --> Resume
Resume --> SpinLoop
In the Suspend
version the SpinLoop
is in the coroutine code block, and the writing of the value is in the eventLoop
.
The spin loop isn't a fair comparison to the coroutine code, so I added a callback version. I didn't write the reciprocal version of callback, like I did with the coroutine code. I'm just going to assume returning from a callback is cheap.
I used the machines I have at home for this. One is an Intel i7-8700K
, the other is a newer AMD Ryzen 5 5600G
. Both CPUs have 6 core, 2 CPUs per core. The kernel has the ioslcpu
param to isolate cores 5 & 6 (CPUs 4,5,10 & 11). The is client pinned to 4, and the server to 5. the AMD is headless, so easy to strip. The Intel is my dev box, so I'd boot to multi-user when running tests. Both are Ubuntu 20.04.
Each test did a 3 second warm up, then ran for 5 seconds. 1K of samples are taken from each run. You can read about criterion's sampling methodology here. Critersion graphing out of the box, but I love R's ggplot. Even with isolated CPUs, there was a fair bit of jitter between runs on the Intel. I ran all the benchmarks multiple times, and took the best run for each benchmark. "Best" is the run w/ the smallest value for upper 95% confidence interval.
The plots are by languages to compare the additional overhead for a given language, not the difference between languages. The plots do not start x-axis at 0. This is to accentuate the differences. This makes it even harder to compare across languages. The graphs clip long tails. the graphs cut off a bit past the worst 95% confidence interval. For this test, tails are more likely because of a bad test environment than actual jitter in the tests.
I could not get GCC to produce the same asm
that zig & Rust output. Clang did, but not GCC. Clang's coroutine support is still experimental, so I went with GCC. The callbacks are implemented with a lambda and std::function
. The coroutine code blatantly stolen from David Mazières excilent blog post and compiled with gcc version 11.1.0
- Likely the AMD spin-loop could be as fast as the zig/rust one, with some GCC tweaking. Just pretend it's at ~56ns-ish
- Resume is expensive on AMD, suspend is the same as a callback, which is impressive.
Rust's async/await implementations is the most complicated by far. The resume code is obtuse. You call a waker
which then calls poll
on the future
that then resumes. Callbacks are awkward in Rust because of the borrow checker, so callbacks might not be the baseline to measure from. All this code was run on release
version, with native
target and lto
enabled rustc 1.55.0
. Writing this code felt like I'm not using async/await the way it was intended. I just shoved into a coroutine looking box. I am an intermediate Rust dev at best. Criticism welcome.
- Why are callbacks & resume so different on Intel vs AMD?
- Despite concern about resume complexity, it's snappy on Intel.
- I would use async/await for Rust because it's a better fit for the language.
- There are other issues in Rust that might make it inappropriate for ULL code w/o a profusion of
unsafe
. This specific test does not point to any issues. Promising.
Zig isn't even 1.0 yet, and it's creator says not to deploy it in production. Maybe it shouldn't be here at all. But I enjoyed writing this version the most. comptime
is super fun. This was made with compiler version 0.8.1
- Callbacks are really fast, or the compiler is good at finding them and in-lining them. It might not scale to dispatching multiple events to multiple listeners.
- Potential for C++ replacement for ULL work, if you can get over no traits or interfaces.
The kotlin project does not include the kotlinx-coroutines-core
library. Just the Coroutine support provided by the compiler is used. The library implements aysnc
and launch
, so neither of those are in the code. Chronical's OpenHFT/Affinity
and OpenHFT/Bytes
packages are used. Bytes
for little-endian reading/writing to a buffer setup from a pointer. You could replace with Unsafe.getLong()
. I used JNA for calls to shm_open
/ ftruncate
/ mmap
. Make sure it's installed if you want to run these. These were run with JVM GraalVM CE 21.2.0 (build 11.0.12+6-jvmci-21.2-b08, mixed mode, sharing)
.
- The JIT can generate some tight machine code. The spin version is right there with the others (except the tails)
- The difference between callbacks and coroutines is smaller than I had assumed. Coroutines isn't a native JVM concept (yet), it entirely done by the kotlin compiler.
- If your already on the JVM, just use coroutines. You have other more important latency worries.
The absolute value of the different between coroutines and callbacks is small. That is impressive on its own. From the two machines tested, as the CPU gets faster the absolute value stays about the same, so the difference in % terms increases. It's only a sample of 2 different hardware setups. Might mean nothing.
Use coroutines, or not? I don't think these numbers point to a firm "no". These types of micro-benchmarks do not do a good job of predicting the effect on an actual codebase doing real work. It could be that the layout of the code/data with coroutines improves/degrades your cache hit rate, which could have a far larger effect than callback vs coroutine. Summary? typical : "It Depends".
The Rust stuff in the root directly. Just install Rust on your system and run cargo build
You'll need a JDK and gradle for the kotlin code. In the kotlin directly run gradle build
Have zig in your path (there is a snap for it). in the zig directory run zig build
In the cpp
directory there are two scripts setup
will create the target directories and set cmake
for them. The build
script builds both targets. GCC 11+ required.
to run the tests, in the root directory run cargo bench
There are some other scripts specific to my environments. The run_bench
sets a nice
level before launching, and copies the data files between runs for later processing. The setup_env
turns off some CPUs, and changes cpu modes.
in src\lib.rs
you can change which CPUs things run on.
The benchmark suite generates html output reports. They are in the target
directory someplace. The R code that generates these graphs is in graphs
.