You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am running the latest pre-release 1.6.1-alpha.0.22
Here is where it occurs in the Cluster class:
publicasyncTaskShutdownAsync(boolgraceful=true,stringreason=""){Logger.LogInformation("Stopping Cluster {Id}",System.Id);// Inform all members of the cluster that this node intends to leave. Also, let the MemberList know that this// node was the one that initiated the shutdown to prevent another shutdown from being called.Logger.LogInformation("Setting GracefullyLeft gossip state for {Id}",System.Id);MemberList.Stopping=true;awaitGossip.SetStateAsync(GossipKeys.GracefullyLeft,newEmpty()).ConfigureAwait(false);
...continueswith shutdown
As you can see, an error there would stop the graceful shutdown in it's tracks. So the member is blocked but never gets to shutdown, so throughout the cluster I see tons of "we are blocked" or "they are blocked" messages. Furthermore it never attempts this again because MemberList.Stopping is now set to true.
I'm not sure why the GossipActor is not able to respond so that's another thing i need to look into, since the gossip loop is also timing out on the BlockGracefullyLeft part.
Nonetheless, it seems it would be good if an issue with the gossip actor would not stop the member from being able to shutdown in a situation like this. So basically curious what other people's thoughts are, and I would be happy to submit some changes for this as well.
Basically thinking about adding try catches and possibly timeouts around each step, so that we can still continue to attempt the rest of the shutdown.
privateasyncTaskAttemptTask(Tasktask,TimeSpantimeout,stringname){task.ContinueWith(t =>{// if the task fails after we timeout, still log the errorif(!t.IsCompletedSuccessfully){Logger.LogError(t.Exception,"Error during shutdown step [{stepName}]",name);}});try{awaitTask.WhenAny(Task.Delay(timeout),task);if(!task.IsCompleted){// if the task isn't complete, we timed outLogger.LogError(t.Exception,"Timeout during shutdown step [{stepName}] after {timeout}",name,timeout);}}catch(Exceptione){// if the task fails while we are waiting, it will already be logged}}
Thanks.
Here's the earlier mentioned error for reference:
RootContext Got exception waiting for RequestAsync response of SetGossipStateKey:SetGossipStateKey { Key = cluster:left, Value = { } } from nonhost/$gossip
System.TimeoutException
Request didn't receive any Response within the expected time.
StackTraceString: at Proto.Future.SharedFutureProcess.SharedFutureHandle.GetTask(CancellationToken cancellationToken)
at Proto.SenderContextExtensions.RequestAsync[T](ISenderContext self, PID target, Object message, CancellationToken cancellationToken)
at Proto.RootLoggingContext.RequestAsync[T](PID target, Object message, CancellationToken cancellationToken)
The text was updated successfully, but these errors were encountered:
benbenwilde
changed the title
Timeout setting gossip state to GracefullyLeft causes graceful shutdown to never finish
ShutdownAsync will not complete when it encounters an error, leaving member in zombie state
Aug 28, 2024
The issue causing gossip timeouts should be fixed by #2133, but this post brings up a separate issue, that the shutdown process is not reliable. This is pretty bad because the member will continue to run but throw various errors and not be able to do anything (zombie state) since it was set to be shutdown but never finished. I'm currently working around this issue by listening for .Cluster().MemberList.Stopping and when that triggers I give it 3 min to complete a clean shutdown otherwise i stop the application anyways.
I am running the latest pre-release
1.6.1-alpha.0.22
Here is where it occurs in the Cluster class:
As you can see, an error there would stop the graceful shutdown in it's tracks. So the member is blocked but never gets to shutdown, so throughout the cluster I see tons of "we are blocked" or "they are blocked" messages. Furthermore it never attempts this again because MemberList.Stopping is now set to true.
I'm not sure why the GossipActor is not able to respond so that's another thing i need to look into, since the gossip loop is also timing out on the
BlockGracefullyLeft
part.Nonetheless, it seems it would be good if an issue with the gossip actor would not stop the member from being able to shutdown in a situation like this. So basically curious what other people's thoughts are, and I would be happy to submit some changes for this as well.
Basically thinking about adding try catches and possibly timeouts around each step, so that we can still continue to attempt the rest of the shutdown.
Off the top of my head, something like:
Where AttemptTask looks something like:
Thanks.
Here's the earlier mentioned error for reference:
The text was updated successfully, but these errors were encountered: