Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Join (and perhaps other cluster changes) - failure to stop vnode #23

Open
martinsumner opened this issue Jan 28, 2025 · 9 comments
Open

Comments

@martinsumner
Copy link
Contributor

During a node join test under heavy load, there were some 2i query timeouts. These timeouts aligned with the end of an ownership handoff, but that one that did not see than source vnode terminating and de-registering in the usual way.

The first log says the ownership handoff is completed.

2025-01-18 14:55:54.452686+00:00 [info] <0.94161446.2>@riak_core_handoff_sender:start_fold/5:277 ownership transfer of riak_kv_vnode from '[email protected]' 1064726778758646071269481333209323371741528981504 to '[email protected]' 1064726778758646071269481333209323371741528981504 completed: sent 10.70 GB bytes in 2804 of 1150868 objects in 1274.35 seconds (8.60 MB/second)

However, for 6 seconds after it is completed there are soft-limit checks to the mailbox (i.e. the local node thinks the vnode is still active for that partition) - but the vnode never responds (presumably shutting down).

2025-01-18 14:55:57.103029+00:00 [warning] <0.127022406.2>@riak_kv_put_fsm:join_mbox_replies/3:1248 soft-limit mailbox check timeout
2025-01-18 14:55:57.103223+00:00 [warning] <0.127022406.2>@riak_kv_put_fsm:check_mailboxes/0:1214 Mailbox soft-load poll timout 100
2025-01-18 14:55:57.103355+00:00 [warning] <0.127022406.2>@riak_kv_put_fsm:add_errors_to_mbox_data/2:1261 Mailbox for {1064726778758646071269481333209323371741528981504,'[email protected]'} did not return in time
2025-01-18 14:55:58.012774+00:00 [warning] <0.100409504.2>@riak_kv_put_fsm:join_mbox_replies/3:1248 soft-limit mailbox check timeout
2025-01-18 14:55:58.012924+00:00 [warning] <0.100409504.2>@riak_kv_put_fsm:check_mailboxes/0:1214 Mailbox soft-load poll timout 100
2025-01-18 14:55:58.013122+00:00 [warning] <0.100409504.2>@riak_kv_put_fsm:add_errors_to_mbox_data/2:1261 Mailbox for {1064726778758646071269481333209323371741528981504,'[email protected]'} did not return in time
2025-01-18 14:55:58.102215+00:00 [warning] <0.105510578.2>@riak_kv_put_fsm:join_mbox_replies/3:1248 soft-limit mailbox check timeout
2025-01-18 14:55:58.102368+00:00 [warning] <0.105510578.2>@riak_kv_put_fsm:check_mailboxes/0:1214 Mailbox soft-load poll timout 100
2025-01-18 14:55:58.102453+00:00 [warning] <0.105510578.2>@riak_kv_put_fsm:add_errors_to_mbox_data/2:1261 Mailbox for {1064726778758646071269481333209323371741528981504,'[email protected]'} did not return in time
2025-01-18 14:55:58.318778+00:00 [warning] <0.127022825.2>@riak_kv_put_fsm:join_mbox_replies/3:1248 soft-limit mailbox check timeout
2025-01-18 14:55:58.319009+00:00 [warning] <0.127022825.2>@riak_kv_put_fsm:check_mailboxes/0:1214 Mailbox soft-load poll timout 100
2025-01-18 14:55:58.319645+00:00 [warning] <0.127022825.2>@riak_kv_put_fsm:add_errors_to_mbox_data/2:1261 Mailbox for {1064726778758646071269481333209323371741528981504,'[email protected]'} did not return in time
2025-01-18 14:55:59.350872+00:00 [warning] <0.117874794.2>@riak_kv_put_fsm:join_mbox_replies/3:1248 soft-limit mailbox check timeout
2025-01-18 14:55:59.350990+00:00 [warning] <0.117874794.2>@riak_kv_put_fsm:check_mailboxes/0:1214 Mailbox soft-load poll timout 100
2025-01-18 14:55:59.351095+00:00 [warning] <0.117874794.2>@riak_kv_put_fsm:add_errors_to_mbox_data/2:1261 Mailbox for {1064726778758646071269481333209323371741528981504,'[email protected]'} did not return in time
2025-01-18 14:56:00.886816+00:00 [warning] <0.127042178.2>@riak_kv_put_fsm:join_mbox_replies/3:1248 soft-limit mailbox check timeout
2025-01-18 14:56:00.886941+00:00 [warning] <0.127042178.2>@riak_kv_put_fsm:check_mailboxes/0:1214 Mailbox soft-load poll timout 100
2025-01-18 14:56:00.887016+00:00 [warning] <0.127042178.2>@riak_kv_put_fsm:add_errors_to_mbox_data/2:1261 Mailbox for {1064726778758646071269481333209323371741528981504,'[email protected]'} did not return in time

There is no log about vnode unregistering. However, 30 seconds after the handoff is completed the vnode is started (again? in-parallel?). The vnode_manager presumably believes this is still the active node for this partition.

2025-01-18 14:56:26.357386+00:00 [info] <0.127019187.2>@riak_kv_leveled_backend:schedule_journalcompaction/4:779 Schedule compaction for interval 2381 on partition 1064726778758646071269481333209323371741528981504
2025-01-18 14:56:26.484294+00:00 [info] <0.1622.0>@riak_core_worker_pool:handle_info/3:253 worker_pool=af3_pool has qlen=0 and no items checked out
2025-01-18 14:56:27.206395+00:00 [info] <0.127019187.2>@riak_kv_vnode:maybe_start_aaecontroller/2:388 AAE Controller started with pid=<0.125723474.2>
2025-01-18 14:56:27.206579+00:00 [info] <0.127019187.2>@riak_core_vnode:do_init/1:248 Starting vnode worker pool riak_kv_worker with size of 5
2025-01-18 14:56:27.207311+00:00 [info] <0.127019187.2>@riak_core_vnode:do_init/1:261 Adding vnode_pool <0.127172069.2> to riak_kv_vnode state

Then 11 seconds later the newly started vnode does a hinted handoff of all its data to the new vnode on the joining node.

2025-01-18 14:57:38.136369+00:00 [info] <0.128054612.2>@riak_core_handoff_sender:start_fold/5:206 Starting hinted transfer of riak_kv_vnode from '[email protected]' 1064726778758646071269481333209323371741528981504 to '[email protected]' 1064726778758646071269481333209323371741528981504
2025-01-18 14:57:39.012945+00:00 [info] <0.127181674.2>@riak_core_handoff_sender:send_objects/1:498 Receiver in sync after batch_set=0 and total_batches=0 with next batch having batch_size=49711 item_count=201 between src_partition=1064726778758646071269481333209323371741528981504 trg_partition=1064726778758646071269481333209323371741528981504 type=hinted module=riak_kv_vnode last_sync_time=0 ms batch_set_time=876 ms
@martinsumner
Copy link
Contributor Author

martinsumner commented Jan 28, 2025

When the handoff completes, there are a series of async messages passed:

  • The handoff_sender process sends a handoff_complete message to the vnode (the source of the handoff);
  • The vnode sends a handoff_complete message to the vnode_manager, with a timer set to resend if necessary;
  • The vnode manager updates its loop state(removing the handoff from handoff_status), then sends a finish_handoff message back to the vnode;
  • The vnode calls handoff_finished on the Module vnode (e.g. riak_kv_vnode), marks the handoff complete by setting a ring change, and then starts the process of shutting down and unregistering.

The ring transition function after handoff finished uses the set_only output:

https://github.com/OpenRiak/riak_core/blob/1b01278e762b835c0777d5c7319fe7b22186560d/src/riak_core_vnode.erl#L634-L652

This does not prompt a gossiping of the ring change. What should gossip though, did this gossip fail to happen?

Was a request received for the vnode when it was not considered active in the local ring, and did that prompt the vnode to be started ... and then the handoff (e.g. perhaps via riak_core_vnode_manager:get_vnode_pid/1)?

@martinsumner
Copy link
Contributor Author

martinsumner commented Jan 31, 2025

The riak_core_gossip process has scheduled gossiping, there is a gossip limit of {Tokens, TimePeriod}, and the process of replenishing the tokens also prompts a gossip of the ring. So this will gossip after 10 seconds ... but in that 10 seconds, the vnode is already shutting down.

Given the token mechanism exists to minimise gossiping, perhaps the gossip should have been prompted - rely on tokens being out if gossiping is too frequent.

@martinsumner
Copy link
Contributor Author

For other cluster changes, e.g. leave and transfer, he participate_in_coverage status to remove the leaving node from coverage before the process starts. This means that the source of the handoff is generally on a node not participating in coverage - and so if the ring change is gossiped after the vnode is unregistered, there are no issues with coverage plans being issued still containing those vnodes.

This is not the case with join. Transfers can come from every node - and so there's no sensible way of ensuring that the source of any handoff is not on a node participating in coverage.

For GET/PUT - it is just a r/w count missing from the preflist - there's no impact.

Ideally the ring change should be gossiped before the vnode deletes - but there would be no obvious way of confirming it has propagated.

there may be some workarounds:

  • postpone the continuation of the handoff_finished until some timeout, in which we expect the ring to be propagated (e.g. 2 x the gossip interval?);
  • make it part of the coverage process to check the vnode is active, A big part of the issue is that requests simply timeout without a response, rather than failing fast.

@martinsumner
Copy link
Contributor Author

martinsumner commented Jan 31, 2025

Although the issue of timeouts is specific to coverage queries, the hinted handoff is still a potential issue, a non-functional one if nothing else because of the re-send of the same data.

It is not clear precisely what happened here though. Presumably the ring_trans call happened, but the unregistering/delete didn't (as why otherwise was it able to start as a fallback and have data to handoff?).

Even with set_only, the ring trans logic appears update everything, e.g. put the new entry in ets, set mochiglobal to direct to ets before returning, so the riak_core_vnode should have been blocked between the ring update and the delete.

@tburghart tburghart transferred this issue from OpenRiak/riak_kv-forked Feb 3, 2025
@martinsumner
Copy link
Contributor Author

The solution probably lies within this change - OpenRiak/riak_core@5c812ff.

When the handoff is complete, the riak_kv_vnode will have its backend deleted before the ring change has been propagated. the riak_kv_vnode still exists at this stage, but without a backend or anything useful for handling a request.

Just as with ensemble_get/ensemble_put it cannot handle a request in the interim state between the handoff completing and the vnode unregistering (and the change being propagated). So should index queries use the same underlying mechanisms and try and forward any query coverage request to the node/idx to which the handoff has completed (and given that the handoff is completed it has the data).

That node may have been configured not to participate_in_coverage - should we override in this case? Or is the reality the 2i querying timing out is the right thing to do.

@martinsumner
Copy link
Contributor Author

martinsumner commented Feb 18, 2025

Given this issue, plus also OpenRiak/riak_core#15 - perhaps a more fundamental change to coverage planning is required:

  • Formally require handle_coverage_request to check the coverage status, and support rejection through some partial re-planning (i.e. being able to ask for a new index from the coverage plan function).
  • Do coverage planning as a race akin to that used in repair. Don't plan, send a request to all vnodes, and each responds with its coverage capability - use the first vnodes to respond that form a complete coverage plan (with vnodes not participating in coverage not responding).

The latter is in-line with some general principles - but may be efficient as it may involve lots of "filter vnodes" to fill gaps (i.e. vnodes which are queries but return less than NVAL partition worth of results).

@martinsumner
Copy link
Contributor Author

martinsumner commented Feb 18, 2025

Further to the possibility of using the mod_set_forwarding functionality added for ensemble.

There exists a window, after the backend has been deleted, and before the riak_core_vnode_manager has unregistered the vnode. The unregistering removes the vnode from the ets lookup table of vnode PIDs, and then sends the unregistered message to the vnode (via the proxy), which prompts shutdown of the vnode. In that window the vnode may still receive requests forwarded from the proxy.

When the riak_kv_vnode:delete/1 function is called from within riak_core_vnode:finish_handoff/2, this will "drop" the backend - which in leveled terms means called leveled_boobkie:book_destroy/1. However, the drop function starts a new backend immediately after the destroy for that partition. So the drop both deletes the backend and starts an empty one in its place.

[edit - this looks like a specific issue with the leveled backend. Neither the bitcask backend or the eleveldb backend do the second start]

So during the window - between the delete and the unregistering - there is still a vnode, registered with the vnode proxy, with an empty backend. Any node which has not seen the propagated ring change may at this stage send a coverage request to this vnode, and have it return 0 results?

This behaviour is OK for standard PUT/GET where the empty result is handled within the FSM without misleading the client (due to quorum).

Once the vnode is unregistered, then this should not happen - but what if a request is still received? As part of the unregistering, the vnode is added to a list of "exclusions" in the riak_core_handoff_manager to prevent it being restarted by the riak_core_ring_handler:ensure_vnodes_started/2. However, a request to a proxy would still prompt the proxy to get_vnode_pid/1, which should prompt the riak_core_vnode_manager to start a new vnode via riak_core_vnode_manager:get_vnode_pid/2.

So are there two windows where a request from a node unaware of the un-gossiped ring change, may prompt the query to be fulfilled by an empty vnode: in the forwarding state before unregistering, and through a restart post-unregistering.

There is then the additional confusion of what happened in this case (above) - the original issue. As the backend was restarted after a handoff completion - but restarted with data, as if the Mod:delete/1 call had not happened. The leveled_bookie:book_destroy/1 has an infinite wait, and the unregistering is not prompted until the Mod:delete/1 is complete. How did this occur?

Also the error that was detected as a result of this, was timeout of 2i queries - this particular test would not show an error on incomplete results. So in this case there was not a rapid query against an empty vnode backend.

@martinsumner
Copy link
Contributor Author

Test to help with investigation

Fundamentally, I have misunderstood the situation. The above test delays the scheduled gossip, and continuously checks an index query returns the correct number of results in a join. The query never fails to return the correct results.

In both windows referred to above, the vnode is in a forwarding state. As the local ring has been changed in the second case, and because of the exit from riak_core_vnode:mark_handoff_complete/5 in the first window. In that forwarding state coverage queries are forwarded to the joining node's vnode once the handoff is complete - so the emptiness of the database is irrelevant. Stray queries are handled.

This is true even if the node to which the query is forwarded has participate_in_coverage disabled. The participate_in_coverage value is relevant only to query planning, and is ignored in forwarding scenarios

@martinsumner
Copy link
Contributor Author

It isn't possible to progress this without further information. It appears that the deletion or the ring change wasn't triggered for this handoff, and there are case statements to allow this ... however it also appears it should have happened.

There is no "fundamental" issue, in that forwarding is working - so under light-load conditions queries will continue to work.

The PR OpenRiak/riak_core#11 has been extended to log what in more detail what does happen at the end of handoffs. This will increase noise at startup, when many handoffs may occur, but outside of startup, handoff is an important event so providing a full log history is worthwhile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant