-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Join (and perhaps other cluster changes) - failure to stop vnode #23
Comments
When the handoff completes, there are a series of async messages passed:
The ring transition function after handoff finished uses the This does not prompt a gossiping of the ring change. What should gossip though, did this gossip fail to happen? Was a request received for the vnode when it was not considered active in the local ring, and did that prompt the vnode to be started ... and then the handoff (e.g. perhaps via |
The riak_core_gossip process has scheduled gossiping, there is a gossip limit of {Tokens, TimePeriod}, and the process of replenishing the tokens also prompts a gossip of the ring. So this will gossip after 10 seconds ... but in that 10 seconds, the vnode is already shutting down. Given the token mechanism exists to minimise gossiping, perhaps the gossip should have been prompted - rely on tokens being out if gossiping is too frequent. |
For other cluster changes, e.g. leave and transfer, he This is not the case with join. Transfers can come from every node - and so there's no sensible way of ensuring that the source of any handoff is not on a node participating in coverage. For GET/PUT - it is just a r/w count missing from the preflist - there's no impact. Ideally the ring change should be gossiped before the vnode deletes - but there would be no obvious way of confirming it has propagated. there may be some workarounds:
|
Although the issue of timeouts is specific to coverage queries, the hinted handoff is still a potential issue, a non-functional one if nothing else because of the re-send of the same data. It is not clear precisely what happened here though. Presumably the ring_trans call happened, but the unregistering/delete didn't (as why otherwise was it able to start as a fallback and have data to handoff?). Even with set_only, the ring trans logic appears update everything, e.g. put the new entry in ets, set mochiglobal to direct to ets before returning, so the riak_core_vnode should have been blocked between the ring update and the delete. |
The solution probably lies within this change - OpenRiak/riak_core@5c812ff. When the handoff is complete, the riak_kv_vnode will have its backend deleted before the ring change has been propagated. the riak_kv_vnode still exists at this stage, but without a backend or anything useful for handling a request. Just as with ensemble_get/ensemble_put it cannot handle a request in the interim state between the handoff completing and the vnode unregistering (and the change being propagated). So should index queries use the same underlying mechanisms and try and That node may have been configured not to |
Given this issue, plus also OpenRiak/riak_core#15 - perhaps a more fundamental change to coverage planning is required:
The latter is in-line with some general principles - but may be efficient as it may involve lots of "filter vnodes" to fill gaps (i.e. vnodes which are queries but return less than NVAL partition worth of results). |
Further to the possibility of using the There exists a window, after the backend has been deleted, and before the When the [edit - this looks like a specific issue with the leveled backend. Neither the bitcask backend or the eleveldb backend do the second start] So during the window - between the delete and the unregistering - there is still a vnode, registered with the vnode proxy, with an empty backend. Any node which has not seen the propagated ring change may at this stage send a coverage request to this vnode, and have it return 0 results? This behaviour is OK for standard PUT/GET where the empty result is handled within the FSM without misleading the client (due to quorum). Once the vnode is unregistered, then this should not happen - but what if a request is still received? As part of the unregistering, the vnode is added to a list of "exclusions" in the So are there two windows where a request from a node unaware of the un-gossiped ring change, may prompt the query to be fulfilled by an empty vnode: in the forwarding state before unregistering, and through a restart post-unregistering. There is then the additional confusion of what happened in this case (above) - the original issue. As the backend was restarted after a handoff completion - but restarted with data, as if the Also the error that was detected as a result of this, was timeout of 2i queries - this particular test would not show an error on incomplete results. So in this case there was not a rapid query against an empty vnode backend. |
Test to help with investigation Fundamentally, I have misunderstood the situation. The above test delays the scheduled gossip, and continuously checks an index query returns the correct number of results in a join. The query never fails to return the correct results. In both windows referred to above, the vnode is in a forwarding state. As the local ring has been changed in the second case, and because of the exit from This is true even if the node to which the query is forwarded has |
It isn't possible to progress this without further information. It appears that the deletion or the ring change wasn't triggered for this handoff, and there are case statements to allow this ... however it also appears it should have happened. There is no "fundamental" issue, in that forwarding is working - so under light-load conditions queries will continue to work. The PR OpenRiak/riak_core#11 has been extended to log what in more detail what does happen at the end of handoffs. This will increase noise at startup, when many handoffs may occur, but outside of startup, handoff is an important event so providing a full log history is worthwhile. |
During a node join test under heavy load, there were some 2i query timeouts. These timeouts aligned with the end of an ownership handoff, but that one that did not see than source vnode terminating and de-registering in the usual way.
The first log says the ownership handoff is completed.
However, for 6 seconds after it is completed there are soft-limit checks to the mailbox (i.e. the local node thinks the vnode is still active for that partition) - but the vnode never responds (presumably shutting down).
There is no log about vnode unregistering. However, 30 seconds after the handoff is completed the vnode is started (again? in-parallel?). The vnode_manager presumably believes this is still the active node for this partition.
Then 11 seconds later the newly started vnode does a hinted handoff of all its data to the new vnode on the joining node.
The text was updated successfully, but these errors were encountered: