Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ShutdownAsync will not complete when it encounters an error, leaving member in zombie state #2128

Open
benbenwilde opened this issue Jun 26, 2024 · 1 comment

Comments

@benbenwilde
Copy link
Contributor

I am running the latest pre-release 1.6.1-alpha.0.22

Here is where it occurs in the Cluster class:

public async Task ShutdownAsync(bool graceful = true, string reason = "")
    {
        Logger.LogInformation("Stopping Cluster {Id}", System.Id);

        // Inform all members of the cluster that this node intends to leave. Also, let the MemberList know that this
        // node was the one that initiated the shutdown to prevent another shutdown from being called.
        Logger.LogInformation("Setting GracefullyLeft gossip state for {Id}", System.Id);
        MemberList.Stopping = true;
        await Gossip.SetStateAsync(GossipKeys.GracefullyLeft, new Empty()).ConfigureAwait(false);

        ... continues with shutdown

As you can see, an error there would stop the graceful shutdown in it's tracks. So the member is blocked but never gets to shutdown, so throughout the cluster I see tons of "we are blocked" or "they are blocked" messages. Furthermore it never attempts this again because MemberList.Stopping is now set to true.

I'm not sure why the GossipActor is not able to respond so that's another thing i need to look into, since the gossip loop is also timing out on the BlockGracefullyLeft part.

Nonetheless, it seems it would be good if an issue with the gossip actor would not stop the member from being able to shutdown in a situation like this. So basically curious what other people's thoughts are, and I would be happy to submit some changes for this as well.

Basically thinking about adding try catches and possibly timeouts around each step, so that we can still continue to attempt the rest of the shutdown.

Off the top of my head, something like:

await AttemptTask(Gossip.SetStateAsync(GossipKeys.GracefullyLeft, new Empty()).ConfigureAwait(false), 
    TimeSpan.FromSeconds(1), "Setting GracefullyLeft"));

Where AttemptTask looks something like:

private async Task AttemptTask(Task task, TimeSpan timeout, string name)
{
    task.ContinueWith(t =>
    {
        // if the task fails after we timeout, still log the error
        if (!t.IsCompletedSuccessfully)
        {
            Logger.LogError(t.Exception, "Error during shutdown step [{stepName}]", name);
        }
    });
    
    try
    {
        await Task.WhenAny(Task.Delay(timeout), task);
        
        if (!task.IsCompleted)
        {
            // if the task isn't complete, we timed out
            Logger.LogError(t.Exception, "Timeout during shutdown step [{stepName}] after {timeout}", name, timeout);
        }
    }
    catch (Exception e)
    {
        // if the task fails while we are waiting, it will already be logged
    }
}

Thanks.


Here's the earlier mentioned error for reference:

RootContext Got exception waiting for RequestAsync response of SetGossipStateKey:SetGossipStateKey { Key = cluster:left, Value = { } } from nonhost/$gossip
System.TimeoutException
Request didn't receive any Response within the expected time.
StackTraceString: at Proto.Future.SharedFutureProcess.SharedFutureHandle.GetTask(CancellationToken cancellationToken)
at Proto.SenderContextExtensions.RequestAsync[T](ISenderContext self, PID target, Object message, CancellationToken cancellationToken)
at Proto.RootLoggingContext.RequestAsync[T](PID target, Object message, CancellationToken cancellationToken)
@benbenwilde benbenwilde changed the title Timeout setting gossip state to GracefullyLeft causes graceful shutdown to never finish ShutdownAsync will not complete when it encounters an error, leaving member in zombie state Aug 28, 2024
@benbenwilde
Copy link
Contributor Author

The issue causing gossip timeouts should be fixed by #2133, but this post brings up a separate issue, that the shutdown process is not reliable. This is pretty bad because the member will continue to run but throw various errors and not be able to do anything (zombie state) since it was set to be shutdown but never finished. I'm currently working around this issue by listening for .Cluster().MemberList.Stopping and when that triggers I give it 3 min to complete a clean shutdown otherwise i stop the application anyways.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant