Increasing KVRocks Write Throughput with Lossless Data Replication #3081

zhixinwen · 2025-07-30T00:56:47Z

zhixinwen
Jul 30, 2025

Hello,

My team is evaluating using KVRocks for write heavy application. We could not tolerate data loss and we need at least one replica. In order to achieve this, I recently added #3075 (zhixinwen#2 will be sent for review once #3075 is merged).

The plan is:

call WAIT 1 after every write command, so we have at least one replica has all the data
set rocksdb.write_options yes for all KVRocks instances, so we would not have data loss.
set replication-group-sync yes for faster replication (added in feat(replication): Add replication-group-sync for replication zhixinwen/kvrocks#2).

I tested the above setup with 1 master and 1 replication with 4 kb payload in SET calls. Each machine has 64 cores CPU and 128 threads. I used a RAID0 disk on top of 4 disks, the stripe size is 512KB. Each disk can write 500MB/s when running fio with 4kb bs. I gave KVRocks 60 workers and set a large buffer size so there was no stalled write.

With this setup I can push the throughput to 160 MB/s. If I disable replication, the throughput is 350 MB/s.

Several bottleneck observed:

On the master instance, although there are multiple workers, the RocksDB write path is essentially single threaded. The workers would be blocked by group commit lock, and I saw a lot of CPU time spent on rocksdb::WriteThread::AwaitState.
Replication logic is single threaded and it cannot benefit from batch write or be paralleled.

I think to solve bottleneck 1, there are two options:

shard KVRocks internally to use multiple RocksDB. I tried to run two KVRocks on one machine and the total throughput doubled to 700 MB/s, showing it is a working proposal. We could map slot range to different RocksDB in one KVRocks. We would prefer to do this rather than solely depend on creating more KVRocks instances in a cluster to avoid the complexity of managing an extra large cluster.
batch write. TiKV raftstore seems to do some internal batching before writing to RocksDB and it shows a pretty good throughput.

To solve bottleneck 2, sharding KVRocks internally may be the only solution unless we want to write our own replication log.

Overall, I think internal sharding may be the most promising solution to solve the issue. In order to do this I think the following condition would need to meet:

Internal sharding should only be available when cluster mode is on, because we will shard by slots.
POLLUPDATES API needs to be changed.
Slot migration job needs to take internal sharding into account
More replication thread would be needed, one thread per shard.

I want to run this through with the community and see if the idea would work, or would you suggest trying something more lightweight first?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Increasing KVRocks Write Throughput with Lossless Data Replication #3081

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Increasing KVRocks Write Throughput with Lossless Data Replication #3081

Uh oh!

zhixinwen Jul 30, 2025

Replies: 0 comments

zhixinwen
Jul 30, 2025