Raft write issues - weird 40ms blocking #1170

sebadob · 2024-07-11T12:59:14Z

sebadob
Jul 11, 2024

I just started experimenting with Openraft and first off, its a really nice project, thanks a lot for this!

Currently, I am investigating and implementing an embeddable SQLite which replicates itself via openraft and so far, everything is working very nicely. But, I came across very low non-expected results in throughput when doing first very simple benchmarks which could be considered production ready (real network, persistence to disk, ...).

The implementation achieves very poor write performance and after a lot of simplifying and reducing variables that might impact the result, I ended up measuring the latency when my client implementation calls append_entries().

All operations, even inserts into the database or modifying the raft logs on disk are done in < 10ms with my test values.
I have set snapshot_policy: SnapshotPolicy::LogsSinceLast(1_000_000), very high on purpose to even get rid of log compactions and snapshotting, which would cause (even when very tiny) spikes in latency. All networking is now happening on localhost and the 3 nodes are started from the same test-backend project. Writes to the logs and state machine are async via channels and blocking operations are executed on separate threads, and when I check the latency on my implementations of the necessary traits, everything is nice and stable. I also tested different network implementations, one with pooled connections over http2 and another one with streaming websocket requests / rpc.

Most of the network requests show a latency of 0-10ms, even when many logs are inserted at the same time, but from time to time the latency jumps up to 40 ms.
The timing between these spikes varies from ~100ms - 6000ms in my testing. The amount of concurrent writers does not change the behavior.
If I increase the amount of data I am inserting (without pausing), over time the default network latency goes up from the 0 - 10 ms range in the beginning well over 20 ms after a couple of thousand writes's.

So, either I have screwed up badly somewhere and got something totally wrong, or is is possible that something is blocking internally? Does anyone have an idea what the reason could be?

Unfortunately, the project is in a very ugly state right now since I am just evaluating and playing around with different network implementations, so it's not public yet.

Edit:

It seems that the increasing latency over time is a problem with my KV store. I am using redb here. I can probably fix that and it has nothing to do with openraft. However, I still cannot get rid of the 40ms spikes.

Edi 2:

I just noticed the

#[allow(clippy::blocks_in_conditions)]
impl RaftNetwork<TypeConfig> for NetworkConnection

in an example. Can this have something to do with it or is this an old annotation?

Answered by sebadob

Jul 13, 2024

I am using openraft-0.9.13.

I tested on 2 different machines, both linux:

Alma Linux 9.4 with kernel 5.14.0-427.24.1.el9_4.x86_64
Fedora 40 with kenrel 6.9.8-200.fc40.x86_64

I added more Instant-checks in a few places for additional debugging. I was using reqwest with connection pooling before which was a huge improvement over single HTTP calls from my very first testing to really understand what openraft is doing and how. In the end, the spikes were actually coming from reqwest, most probably an internal lock for the connection pool, because it did never happen without sending heavy loads.

I ended up with writing a lower level WebSocket impl with the fastwebsockets crate for the Raft i…

View full answer

ariesdevil · 2024-07-11T17:01:05Z

ariesdevil
Jul 11, 2024

Have you tested the performance when using only mem store? eg: https://github.com/datafuselabs/openraft/tree/main/examples/raft-kv-memstore-network-v2

9 replies

sebadob Jul 12, 2024
Author

I created a LogsStore with rocksdb under the hood and the performance gain was pretty insane. It is almost as fast as the in-memory store (on a really fast SSD though). I can now reach the ~17k inserts / s even on-disk with both SQLite state machine and rocksdb as the logs store. I have no idea why redb is most often more than 12 times slower for me when they were showing in benchmarks, that redb should actually be faster than rocksdb most of the time.

I will try to resolve the issue over at their repo and also do another test with a second sqlite as logs store to make sure the problem is actually my redb implementation.

If I can find the reason for the 40ms spikes at some point, I will let you know, but maybe this solves automatically with the openraft 0.10 when it allows concurrency for network requests in the future. I have the suspicion that it is something with the heartbeat interval. When I had this value too low I saw leader switches all the time with my way too slow redb implementation.

I could have solved that myself as well, sorry for bothering you, but thanks a lot for your help.

ariesdevil Jul 12, 2024

I have the suspicion that it is something with the heartbeat interval. When I had this value too low I saw leader switches all the time with my way too slow redb implementation.

This could happen because, as far as I know, the heartbeat and regular log sending currently share the same connection. If redb is processing logs too slowly and the heartbeat interval is too short, it's possible that a heartbeat timeout may occur during the processing of the previous log entry, leading to a new leader election. cc @drmingdrmer correct me if I miss something.

BTW: This issue exists in many Raft implementations and OpenRaft has already realized this, and will refactor this part of the code later.

drmingdrmer Jul 13, 2024
Maintainer

What's the version of Openraft were you using in your project? And your OS?

If there are still 40ms spikes with pure memory log and pure memory state machine, the problem might be with the network.
There is an SSL issue on linux that slow down an RPC by about 60ms:

Example raft-kv-memstore hangs after printing change-membership #550

On my Macbook there is no such problem, therefore you can check if it's the cause by running your project on a Mac maybe.

The issue that heavy AppendEntries blocks heartbeat can be solved by handling AppendEntries RPC and heartbeat RPC with separate Network. It's on my todo list :)
With Openraft 0.9.* Append Log IO is done inside RaftCore main task that might block for several ms but AFAIK it won't be 40ms. the main branch has made Append Log IO non-blocking with callback.
Please attach the DEBUG level log so that I can figure out what's the root of the 40ms spike.
It looks like Openraft should add some tracing info about the lifecycle of each log entry, such as the time it is proposed, flushed to log storage and the time that is applied. This would be helpful tracing such latency issues. Let me add this feature.

sebadob Jul 13, 2024
Author

I am using openraft-0.9.13.

I tested on 2 different machines, both linux:

Alma Linux 9.4 with kernel 5.14.0-427.24.1.el9_4.x86_64
Fedora 40 with kenrel 6.9.8-200.fc40.x86_64

I added more Instant-checks in a few places for additional debugging. I was using reqwest with connection pooling before which was a huge improvement over single HTTP calls from my very first testing to really understand what openraft is doing and how. In the end, the spikes were actually coming from reqwest, most probably an internal lock for the connection pool, because it did never happen without sending heavy loads.

I ended up with writing a lower level WebSocket impl with the fastwebsockets crate for the Raft internal calls and I finally got rid of the spikes. To keep it simple for now and make reconnects easier, the WebSocket still works synchronously, meaning it sends out a request and await's the response.

I may try tonic as well, but don't expect too much benefit from it. Serialization is done with bincode instead of serde_json already and the WebSocket sends binary data, which gave a pretty big boost. tonic would force me to use prost (I guess), which was a bit slower than bincode in testing.

The last tests achieved ~ 24k write / s with on-disk rocksdb for logs and on-disk sqlite as the state machine. I did not even expect the OPs to be that high for sqlite. That's an awesome result!

So I am currently with only 2 questions left:

Is the sequential network flow I am using with my WebSocket enough and is that also the internal flow of openraft, or may raft-internal requests overlap and can I improve throughput with making it more compliicated by spawning separate sender and receiver and have a 3rd thread managing them, especially re-connects? Only append, vote and snapshot are using the same WebSocket in this sequential way.

When you say main is already doing async IO, does it mean if I am doing the sync IO with rocksdb in my append(), that it would block the thread and I would not have a benefit when using 0.10 later on without making this async, or is sync disk access ok in that case? Otherwise I would do the same as with my first redb impl, run writer and reader on their own threads and communicate over channels.
Actually, the async annotations in the Trait definitions made me think that I must not block inside them, but the examples do. If this is not the case because these are running on separate blocking tasks internally anyway and using sync disk IO is fine, it would be very helpful to maybe add a note somewhere in the docs (or did I miss such one?).

Answer selected by sebadob

drmingdrmer Jul 13, 2024
Maintainer

In the issue I mentioned above, the 60ms spike is also caused by reqwest:
#550 (comment)

If you are not going to use reqwest then no need to dig into this issue.

Is the sequential network flow I am using with my WebSocket enough and is that also the internal flow of openraft, or may raft-internal requests overlap and can I improve throughput with making it more compliicated by spawning separate sender and receiver and have a 3rd thread managing them, especially re-connects? Only append, vote and snapshot are using the same WebSocket in this sequential way.

Internally inside Openraft, AppendEntries are serialized: The next one won't be sent until the previous response is received. Heartbeat uses the same Network instance as AppendEntries. RequestVote RPC and Transmitting snapshot are done with two separate Network instances. Thus there is no need to have more than one threads in a Network implementation.

When you say main is already doing async IO, does it mean if I am doing the sync IO with rocksdb in my append(), that it would block the thread and I would not have a benefit when using 0.10 later on without making this async, or is sync disk access ok in that case? Otherwise I would do the same as with my first redb impl, run writer and reader on their own threads and communicate over channels.
Actually, the async annotations in the Trait definitions made me think that I must not block inside them, but the examples do. If this is not the case because these are running on separate blocking tasks internally anyway and using sync disk IO is fine, it would be very helpful to maybe add a note somewhere in the docs (or did I miss such one?).

What I mean is non-blocking mode IO: RaftLogStorage::append(log_entries, callback) does not have to flush IO to disk before return, but it must flush the IO before calling the callback. append() is an async fn therefore it should never do any sync IO in it. The sync IO should be done in some worker thread that cooperates with RaftLogStorage implementation with channel.

The example should not do it either:( I should add some comment to explain that such simplification are not meant to be used in production.

sebadob · 2024-07-13T12:04:39Z

sebadob
Jul 13, 2024
Author

What I mean is non-blocking mode IO: RaftLogStorage::append(log_entries, callback) does not have to flush IO to disk before return, but it must flush the IO before calling the callback. append() is an async fn therefore it should never do any sync IO in it. The sync IO should be done in some worker thread that cooperates with RaftLogStorage implementation with channel.

Ahh okay, got it. Is the callback then Send so I can offload it to a different thread or something like that via a channel?

The example should not do it either:( I should add some comment to explain that such simplification are not meant to be used in production.

The fix will be pretty easy, I can basically just re-use my approach with redb to make it non-blocking. Thanks for the clarification.

Regarding the performance issues I had with redb, I read a lot of documentation and source code in the last hours. I already mentioned it in a discussion over at the redb repo, but the root cause is that rust-rocksdb is not crash safe by default, while redb is.

When you take a look at facebooks documentation for rocksdb, they mention that it's crash safe with default settings. However, the rust wrappers default for the wal sync is set to false, which makes rocksdb non-crash safe regarding to the official documentation for it.

When I wait for sync with rocksdb, it is actually even ~40% slower during batch insertions in my tests compared to redb. I mean, both of them give me a very low throughput for the Raft when I wait for sync to disk each time, but this at least explains the huge difference I got in the beginning.

To make it crash safe, I am syncing the WAL with rocksdb like this:

let mut opts = WriteOptions::default();
opts.set_sync(true);
self.db
    .write_opt(batch, &opts)
    .map_err(|err| StorageIOError::write_logs(&err))?;

I then end up at ~780 put / s with rocksdb with in this case 16 concurrent writers compared to ~15k with the same settings when sync is off (default).

This brings me to a another question (sorry ^^).
Let's assume I leave sync off with rocksdb to have the desired throughput and the machine crashes in production. I will loose the OS buffered logs that have not been written to disk yet which means I can end up in a state where I have a higher last_log_id in my metadata or even an inconsistent state, while having lost let's say the last 1000 logs before this ID. When the node later on restarts, I guess openraft will complain that it want's let's say the ID 10_000 which was the last applied, but my logs store contains only up to 9_000 because of no-sync. Will it in this case simply fail to start and I need manual interaction, like deleting the volume completely and let it rebuild from the other nodes, or can it automatically fetch the lost data from other nodes and get healthy again?

I mean, the likelihood of a full crash is rather low and to make it safe, I would need to sacrifice a huge amount of throughput.
It could maybe keep track with a file or something if it did a graceful shutdown before, for instance taking a look if an empty file exists, which should have been deleted during a graceful shutdown. This could be handled inside a shutdown handle which could be added to the Trait. If it did not do a graceful shutdown, it could maybe delete all local content if other nodes are healthy and fetch from there?

2 replies

drmingdrmer Jul 13, 2024
Maintainer

Ahh okay, got it. Is the callback then Send so I can offload it to a different thread or something like that via a channel?

Yes. It is Send:
https://github.com/datafuselabs/openraft/blob/3ae6b4bfe2d13e633e437373f6c1b716b167bfa3/openraft/src/storage/callback.rs#L24

When you take a look at facebooks documentation for rocksdb, they mention that it's crash safe with default settings. However, the rust wrappers default for the wal sync is set to false, which makes rocksdb non-crash safe regarding to the official documentation for it.

Hmmm... I do not love it.

Will it in this case simply fail to start and I need manual interaction

Openraft can not just scan all of the log to check if it's in a consistent state during startup. It won't fail upon startup but will fail soon later for lacking of expected log entries.

In this case, if your application detected the storage being in inconsistent state, Openraft with feature flag loosen-follower-log-revert enabled allows it to just remove all of the data, including log, state machine and snapshot and wait for the Leader to replicate data to it for recovery.

Openraft can not rely on some kind of graceful shutdown for correctness. If an IO method such as RaftLogStorage::append() finished(callback is called), the data must be flushed to disk and can always be read on next restart.

To extend the throughput, in the RaftLogStorage::append(entries, callback) implementation, you can batch several callbacks in one shot. For example, flush once for every 100 append() and then invoke all of the 100 callbacks.

And it is possible to just invoke the last callback. This is the expected optimal approach but for now it is not fully reviewed and tested in the current codebase.

sebadob Jul 14, 2024
Author

Hmmm... I do not love it.

I digged a bit deeper again. It is not crash safe depending on what exactly crashes.
If it's just the process that crashes, the OS should write everything left inside the buffers to disk, so it should be fine in this case.
However, if your OS / storage crashes due to a powerloss or maybe because your are using network storage, then you will loose data without the additional WriteOption with sync enabled, or without a manual db.flush_wal(true).
So, it is partly crash safe by default, but in a way that should not make much of a difference when using Rust, because ideally your applications will never panic anyway.

Openraft can not just scan all of the log to check if it's in a consistent state during startup. It won't fail upon startup but will fail soon later for lacking of expected log entries.

I did some tests and introduced errors on purpose and I got what you mean.

To extend the throughput, in the RaftLogStorage::append(entries, callback) implementation, you can batch several callbacks in one shot. For example, flush once for every 100 append() and then invoke all of the 100 callbacks.

Yes I actually prepared this stuff in my code already and will test the async callback as soon as it makes sense to me. Until then I need to do a lot of groundwork before I can even think about releasing a very first open source version for testing for others. At the moment, the code is super messy from all my testing.

When I do a direct sync to disk (without the possibility to do an async callback), it actually makes sense that I max out at ~1k write / s. An SSD typically is around 1ms latency when writing data, which means, since the logs must come in sequentially, it makes sense to end up at ~1000 logs / s, when you consider that there is network and other latency in between.

Thank you very much so far. I have quite a bit of stuff left to do, but if I make it before the next release, I will do some testing with the current main branch as well and report some insights from my side.

sebadob · 2024-07-18T17:09:17Z

sebadob
Jul 18, 2024
Author

Just fyi, if you are interested in it, I open sourced a very first not-yet-ready version of the project, thanks to openraft

https://github.com/sebadob/hiqlite

2 replies

drmingdrmer Jul 19, 2024
Maintainer

Cool! Looking forward to some architecture design and a kickstart tutorial!

sebadob Aug 29, 2024
Author

I just released the first v0.1.0 you might want to take a look at:

https://github.com/sebadob/hiqlite

https://docs.rs/hiqlite/0.1.0/hiqlite/

The architecture overview is pretty basic so far (no nice graphics), but it should give you an idea about the internals without reading through the code. I am pretty surprised I how far I could push SQLite with its single writer limitation. Restarts or rolling releases of a Kubernetes STS can be a bit bumpy still (race conditions most probably), but this will be solved after more real world testing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raft write issues - weird 40ms blocking #1170

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 13 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Raft write issues - weird 40ms blocking #1170

sebadob Jul 11, 2024

Replies: 3 comments · 13 replies

ariesdevil Jul 11, 2024

sebadob Jul 12, 2024 Author

ariesdevil Jul 12, 2024

drmingdrmer Jul 13, 2024 Maintainer

sebadob Jul 13, 2024 Author

drmingdrmer Jul 13, 2024 Maintainer

sebadob Jul 13, 2024 Author

drmingdrmer Jul 13, 2024 Maintainer

sebadob Jul 14, 2024 Author

sebadob Jul 18, 2024 Author

drmingdrmer Jul 19, 2024 Maintainer

sebadob Aug 29, 2024 Author

sebadob
Jul 11, 2024

Replies: 3 comments 13 replies

ariesdevil
Jul 11, 2024

sebadob Jul 12, 2024
Author

drmingdrmer Jul 13, 2024
Maintainer

sebadob Jul 13, 2024
Author

drmingdrmer Jul 13, 2024
Maintainer

sebadob
Jul 13, 2024
Author

drmingdrmer Jul 13, 2024
Maintainer

sebadob Jul 14, 2024
Author

sebadob
Jul 18, 2024
Author

drmingdrmer Jul 19, 2024
Maintainer

sebadob Aug 29, 2024
Author