Node repair efficiency #21

martinsumner · 2025-01-16T12:36:32Z

Node repair, especially when under write load - is inefficient.

Assuming leveled backend. Assume there are 2M objects in a src partition, target node has 64 partitions.

Currently it the repair fold is done sqn_order in leveled. This means the fold is over the Journal. The Journal normally has about 25% of excess objects (due to deferred compaction).

Therefore, ther ewill be 2.5M events in the fold of:

read next K/V
deserialise object
Check SQN in ledger (requires a deserialise block)

This will filter the 0.5m un-compacted objects, and pass 2M to the handoff sender.

This will then apply Filter(K) in riak_core_handoff_sender:visit_item/3, which will discard either 1/3rd or 2/3rds of these objects - before encoding the handoff item and sending.

The receiver will then do a local HEAD request, and discard any object which is already present (i.e. due to hinted handoff, and read repair). In long repairs.

For repairs, folding in key_order might be better, and using deferred object - so that the deserialisation of the value only occurs either 1.2m or 600k times (not 2.5M times). Possible to extend so that in encode_handoff_item, if it is a deferred fetch it can make the vnode_head remotely so that the value is not fetched/deserialised if the receiver already has the object.

The text was updated successfully, but these errors were encountered:

martinsumner · 2025-01-17T09:10:25Z

This will then apply Filter(K) in riak_core_handoff_sender:visit_item/3, which will discard either 1/3rd or 2/3rds of these objects - before encoding the handoff item and sending.

The above statement is incorrect. Both vnodes will send 2/3rds of objects.

There is some complication with varying nvals. But in essence a pair of vnodes (RepairNode - 1) and (RepairNode + 1), and they both repair everything owned by the target they know about. For buckets which are nval=3 this will be 2/3rds, for nval 5 this will be 4/5ths. There is therefore considerable duplication of effort.

On the receiver the redundant results can be quickly discarded after a comparison (which with leveled just requires a HEAD request).

martinsumner · 2025-01-20T13:09:18Z

Currently the riak_core_vnode_manager that coordinates repairs is coupled to the idea of a repair being done by a pair of vnodes (plus one in the ring, or minus one in the ring).

This should be refactored so that repair can be done by a list of vnodes. When a riak_core_handoff_sender starts a handoff of type repair, it should check the riak_core_vnode_manager on the target for a negative filter for that handoff.

If the riak_core_vnode_manager is unaware of the repair it should return ok (which is also the standard return for an unexpected message in riak_corE_vnode_manager, and so will be the return from a legacy node during a cluster upgrade).

If the riak_core_vnode_manager is aware of the repair, it should send the current list of negative filters to the handoff, and add a negative_filter for this sender to the list of negative filters. A negative filter is FilterMod:FilterFun(Src).

The FilterFun use in riak_core_handoff_sender:visit_item/3, should check the positive filter fun first (i.e. the Key is expected in the target vnode), but then should confirm the Filter is false for all negative filters (i.e. the Key is not in a cource vnode that is already handing off).

The key advantage now, is that slow nodes will end up lagging on their transfers but will only need to send the remainder (i.e. if a pair is still used this will be 1/3rd not 2/3rds for nval=3).

Then two extensions can further enhance this:

the ability to switch to fold_heads and deferred fetch with the leveled backend (which means the Key Filter is checke dbefore the object is read);
the use of a double_pair not a pair (i.e. Before -1, Before -2 and After + 1, After + 2) in the repair.

This would mean that a lagging node would have handoffs where all keys are filtered (so it is just a head fold required, no objects will be fetched).

tburghart transferred this issue from OpenRiak/riak_kv-forked Feb 3, 2025

martinsumner mentioned this issue Feb 4, 2025

Nhse o34 orkv.i61 noderepair OpenRiak/riak_test#7

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node repair efficiency #21

Node repair efficiency #21

martinsumner commented Jan 16, 2025

martinsumner commented Jan 17, 2025 •

edited

Loading

martinsumner commented Jan 20, 2025

Node repair efficiency #21

Node repair efficiency #21

Comments

martinsumner commented Jan 16, 2025

martinsumner commented Jan 17, 2025 • edited Loading

martinsumner commented Jan 20, 2025

martinsumner commented Jan 17, 2025 •

edited

Loading