Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node repair efficiency #21

Open
martinsumner opened this issue Jan 16, 2025 · 2 comments
Open

Node repair efficiency #21

martinsumner opened this issue Jan 16, 2025 · 2 comments

Comments

@martinsumner
Copy link
Contributor

Node repair, especially when under write load - is inefficient.

Assuming leveled backend. Assume there are 2M objects in a src partition, target node has 64 partitions.

Currently it the repair fold is done sqn_order in leveled. This means the fold is over the Journal. The Journal normally has about 25% of excess objects (due to deferred compaction).

Therefore, ther ewill be 2.5M events in the fold of:

  • read next K/V
  • deserialise object
  • Check SQN in ledger (requires a deserialise block)

This will filter the 0.5m un-compacted objects, and pass 2M to the handoff sender.

This will then apply Filter(K) in riak_core_handoff_sender:visit_item/3, which will discard either 1/3rd or 2/3rds of these objects - before encoding the handoff item and sending.

The receiver will then do a local HEAD request, and discard any object which is already present (i.e. due to hinted handoff, and read repair). In long repairs.

For repairs, folding in key_order might be better, and using deferred object - so that the deserialisation of the value only occurs either 1.2m or 600k times (not 2.5M times). Possible to extend so that in encode_handoff_item, if it is a deferred fetch it can make the vnode_head remotely so that the value is not fetched/deserialised if the receiver already has the object.

@martinsumner
Copy link
Contributor Author

martinsumner commented Jan 17, 2025

This will then apply Filter(K) in riak_core_handoff_sender:visit_item/3, which will discard either 1/3rd or 2/3rds of these objects - before encoding the handoff item and sending.

The above statement is incorrect. Both vnodes will send 2/3rds of objects.

There is some complication with varying nvals. But in essence a pair of vnodes (RepairNode - 1) and (RepairNode + 1), and they both repair everything owned by the target they know about. For buckets which are nval=3 this will be 2/3rds, for nval 5 this will be 4/5ths. There is therefore considerable duplication of effort.

On the receiver the redundant results can be quickly discarded after a comparison (which with leveled just requires a HEAD request).

@martinsumner
Copy link
Contributor Author

Currently the riak_core_vnode_manager that coordinates repairs is coupled to the idea of a repair being done by a pair of vnodes (plus one in the ring, or minus one in the ring).

This should be refactored so that repair can be done by a list of vnodes. When a riak_core_handoff_sender starts a handoff of type repair, it should check the riak_core_vnode_manager on the target for a negative filter for that handoff.

If the riak_core_vnode_manager is unaware of the repair it should return ok (which is also the standard return for an unexpected message in riak_corE_vnode_manager, and so will be the return from a legacy node during a cluster upgrade).

If the riak_core_vnode_manager is aware of the repair, it should send the current list of negative filters to the handoff, and add a negative_filter for this sender to the list of negative filters. A negative filter is FilterMod:FilterFun(Src).

The FilterFun use in riak_core_handoff_sender:visit_item/3, should check the positive filter fun first (i.e. the Key is expected in the target vnode), but then should confirm the Filter is false for all negative filters (i.e. the Key is not in a cource vnode that is already handing off).

The key advantage now, is that slow nodes will end up lagging on their transfers but will only need to send the remainder (i.e. if a pair is still used this will be 1/3rd not 2/3rds for nval=3).

Then two extensions can further enhance this:

  • the ability to switch to fold_heads and deferred fetch with the leveled backend (which means the Key Filter is checke dbefore the object is read);
  • the use of a double_pair not a pair (i.e. Before -1, Before -2 and After + 1, After + 2) in the repair.

This would mean that a lagging node would have handoffs where all keys are filtered (so it is just a head fold required, no objects will be fetched).

@tburghart tburghart transferred this issue from OpenRiak/riak_kv-forked Feb 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant