-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node repair efficiency #21
Comments
The above statement is incorrect. Both vnodes will send 2/3rds of objects. There is some complication with varying nvals. But in essence a pair of vnodes (RepairNode - 1) and (RepairNode + 1), and they both repair everything owned by the target they know about. For buckets which are nval=3 this will be 2/3rds, for nval 5 this will be 4/5ths. There is therefore considerable duplication of effort. On the receiver the redundant results can be quickly discarded after a comparison (which with leveled just requires a HEAD request). |
Currently the This should be refactored so that repair can be done by a list of vnodes. When a If the If the The FilterFun use in The key advantage now, is that slow nodes will end up lagging on their transfers but will only need to send the remainder (i.e. if a pair is still used this will be 1/3rd not 2/3rds for nval=3). Then two extensions can further enhance this:
This would mean that a lagging node would have handoffs where all keys are filtered (so it is just a head fold required, no objects will be fetched). |
Node repair, especially when under write load - is inefficient.
Assuming leveled backend. Assume there are 2M objects in a src partition, target node has 64 partitions.
Currently it the repair fold is done
sqn_order
in leveled. This means the fold is over the Journal. The Journal normally has about 25% of excess objects (due to deferred compaction).Therefore, ther ewill be 2.5M events in the fold of:
This will filter the 0.5m un-compacted objects, and pass 2M to the handoff sender.
This will then apply Filter(K) in
riak_core_handoff_sender:visit_item/3
, which will discard either 1/3rd or 2/3rds of these objects - before encoding the handoff item and sending.The receiver will then do a local HEAD request, and discard any object which is already present (i.e. due to hinted handoff, and read repair). In long repairs.
For repairs, folding in key_order might be better, and using deferred object - so that the deserialisation of the value only occurs either 1.2m or 600k times (not 2.5M times). Possible to extend so that in encode_handoff_item, if it is a deferred fetch it can make the vnode_head remotely so that the value is not fetched/deserialised if the receiver already has the object.
The text was updated successfully, but these errors were encountered: