Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for {F, async} as a data_size option #19

Merged
merged 2 commits into from
Mar 20, 2025

Conversation

martinsumner
Copy link
Contributor

@martinsumner martinsumner commented Feb 27, 2025

This is intended for backends where it is convenient to return a function, (because the size is too expensive to calculate synchronously), but expensive to recalculate continuously (as would happen with {F, dynamic}, where the size is recalculated for every user CLI request for handoff/transfer status.

It would also be possible for the F() in {F, async} to return {F0, dynamic} should this be required.

This is intended for backends where it is convenient to return a function, (because the size is too expensive to calculate synchronously), but expensive to recalculate continuously (as would happen with {F, dynamic}, where the size is recalculated for every user CLI request for handoff/transfer status).

It would also be possible for the F() in {F, async} to return {F0, dynamic} should this be required.
@martinsumner
Copy link
Contributor Author

martinsumner commented Feb 28, 2025

So the approach of using {F, async} rather than {F, dynamic} means that now an operator call for handoff (or transfer) status will not result in a fresh query. The {F, async} will result in a single query when the riak_core_handoff_manager prompts the outbound transfer - and so it will reflect the value at the stats of the transfer (when the snapshot is taken).

The downside is that is an outbound request is constrained by an inbound transfer limit - each team the request is re-scheduled, after failing because of the max_concurrency at the receiver, the query will be re-run. It needs to be re-run (in the sense that otherwise the size will not be accurate).

This limit is applied after the riak_core.forced_ownership_handoff has been applied - so the validate_size will not be called on anything filtered at this stage.

So if there is an issue with repeated calls to validate_size (and re-running of the size query), then the riak_core.forced_ownership_handoff limit should be set to not exceed the riak_core.handoff_concurrency limit. This is not the default, but this might be necessary if (say) a single node is joining and receiving handoffs from many nodes.

Alternatively the riak_core.vnode_management_timer can be increased.

There is no obvious alternative, as there is no other point to which the size calculation can be deferred, without a significant refactoring of the chain of processes involved in prompting handoff.

As the riak_core_handoff_manager is a singleton process, even in this worst case, only a single CPU per node may be occupied - as the manager can only pick up one vnode at a time. Also the total number of CPUs busied in the cluster will be riak_core.forced_ownership_handoff - riak_core.handoff_concurrency.

@martinsumner martinsumner merged commit 1728625 into openriak-3.2 Mar 20, 2025
1 check passed
@martinsumner martinsumner deleted the nhse-o32-orkv.i29-asyncfun branch March 20, 2025 17:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants