You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My team is evaluating using KVRocks for write heavy application. We could not tolerate data loss and we need at least one replica. In order to achieve this, I recently added #3075 (zhixinwen#2 will be sent for review once #3075 is merged).
The plan is:
call WAIT 1 after every write command, so we have at least one replica has all the data
set rocksdb.write_options yes for all KVRocks instances, so we would not have data loss.
I tested the above setup with 1 master and 1 replication with 4 kb payload in SET calls. Each machine has 64 cores CPU and 128 threads. I used a RAID0 disk on top of 4 disks, the stripe size is 512KB. Each disk can write 500MB/s when running fio with 4kb bs. I gave KVRocks 60 workers and set a large buffer size so there was no stalled write.
With this setup I can push the throughput to 160 MB/s. If I disable replication, the throughput is 350 MB/s.
Several bottleneck observed:
On the master instance, although there are multiple workers, the RocksDB write path is essentially single threaded. The workers would be blocked by group commit lock, and I saw a lot of CPU time spent on rocksdb::WriteThread::AwaitState.
Replication logic is single threaded and it cannot benefit from batch write or be paralleled.
I think to solve bottleneck 1, there are two options:
shard KVRocks internally to use multiple RocksDB. I tried to run two KVRocks on one machine and the total throughput doubled to 700 MB/s, showing it is a working proposal. We could map slot range to different RocksDB in one KVRocks. We would prefer to do this rather than solely depend on creating more KVRocks instances in a cluster to avoid the complexity of managing an extra large cluster.
batch write. TiKV raftstore seems to do some internal batching before writing to RocksDB and it shows a pretty good throughput.
To solve bottleneck 2, sharding KVRocks internally may be the only solution unless we want to write our own replication log.
Overall, I think internal sharding may be the most promising solution to solve the issue. In order to do this I think the following condition would need to meet:
Internal sharding should only be available when cluster mode is on, because we will shard by slots.
POLLUPDATES API needs to be changed.
Slot migration job needs to take internal sharding into account
More replication thread would be needed, one thread per shard.
I want to run this through with the community and see if the idea would work, or would you suggest trying something more lightweight first?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
My team is evaluating using KVRocks for write heavy application. We could not tolerate data loss and we need at least one replica. In order to achieve this, I recently added #3075 (zhixinwen#2 will be sent for review once #3075 is merged).
The plan is:
WAIT 1
after every write command, so we have at least one replica has all the datarocksdb.write_options yes
for all KVRocks instances, so we would not have data loss.replication-group-sync yes
for faster replication (added in feat(replication): Add replication-group-sync for replication zhixinwen/kvrocks#2).I tested the above setup with 1 master and 1 replication with 4 kb payload in SET calls. Each machine has 64 cores CPU and 128 threads. I used a RAID0 disk on top of 4 disks, the stripe size is 512KB. Each disk can write 500MB/s when running fio with 4kb bs. I gave KVRocks 60 workers and set a large buffer size so there was no stalled write.
With this setup I can push the throughput to
160 MB/s
. If I disable replication, the throughput is350 MB/s
.Several bottleneck observed:
rocksdb::WriteThread::AwaitState
.I think to solve bottleneck 1, there are two options:
700 MB/s
, showing it is a working proposal. We could map slot range to different RocksDB in one KVRocks. We would prefer to do this rather than solely depend on creating more KVRocks instances in a cluster to avoid the complexity of managing an extra large cluster.To solve bottleneck 2, sharding KVRocks internally may be the only solution unless we want to write our own replication log.
Overall, I think internal sharding may be the most promising solution to solve the issue. In order to do this I think the following condition would need to meet:
I want to run this through with the community and see if the idea would work, or would you suggest trying something more lightweight first?
Beta Was this translation helpful? Give feedback.
All reactions