Workflow failure due to runner shutdown/stoppage #2040

hamidgg · 2022-08-05T10:13:20Z

Description

Since 30 July 2022, our workflow fails with the following message:

"The self-hosted runner: ***** lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error."

We run our workflow on an AWS EC2 instance which is always connected and has enough resources (CPU/memory). The above failure happens even for the parts of the workflow that don't require high utilization of CPU/memory.

It seems that runner loses communication with GitHub and does not continue running the job.

Log

[2022-08-05 09:05:47Z INFO JobServer] Caught exception during append web console line to websocket, let's fallback to sending via non-websocket call (total calls: 48, failed calls: 2, websocket state: Open).
[2022-08-05 09:05:47Z ERR JobServer] System.Net.WebSockets.WebSocketException (0x80004005): The remote party closed the WebSocket connection without completing the close handshake.
---> System.IO.IOException: Unable to write data to the transport connection: An existing connection was forcibly closed by the remote host..
---> System.Net.Sockets.SocketException (10054): An existing connection was forcibly closed by the remote host.

Runner Version and Platform

Version 2.294.0 and runs on Windows Server 2016

AvaStancu · 2022-08-10T15:50:58Z

@hamidgg would it be possible to share the RunnerListener and RunnerWorker logs too?

liu-shaojun · 2022-08-16T03:26:36Z

I got the same error when running job in the self-hosted runners

The self-hosted runner: xxxx lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.

other people also have the same error message
#1546 (comment)
#1546 (comment) @pinggao187

Hi expert @AvaStancu, could you help to check on this? This issue is kind of blocking our progress...

seantleonard · 2022-08-17T03:35:51Z

I am getting this as well. Unfortunately the other issue referenced is closed but has many many reports (even after closure) about the same behavior.

hamidgg · 2022-08-17T09:07:05Z

@AvaStancu Sorry for my delayed reply. I've been waiting for another failure to get the RunnerListener and RunnerWorker logs as previous logs were cleaned up. I'll get back to you with the logs once a similar failure happens (hopefully not :D).

zaknafein83 · 2022-09-16T15:05:04Z

same situation. I use AWS EC2 and get, after first run, this error

[2022-09-16 10:42:53Z INFO JobServerQueue] All queue process tasks have been stopped, and all queues are drained. [2022-09-16 10:42:53Z INFO TempDirectoryManager] Cleaning runner temp folder: /home/ubuntu/actions-runner/_work/_temp [2022-09-16 10:42:53Z INFO HostContext] Well known directory 'Bin': '/home/ubuntu/actions-runner/bin' [2022-09-16 10:42:53Z INFO HostContext] Well known directory 'Root': '/home/ubuntu/actions-runner' [2022-09-16 10:42:53Z INFO HostContext] Well known directory 'Diag': '/home/ubuntu/actions-runner/_diag' [2022-09-16 10:42:53Z INFO HostContext] Well known config file 'Telemetry': '/home/ubuntu/actions-runner/_diag/.telemetry' [2022-09-16 10:42:53Z INFO JobRunner] Raising job completed event [2022-09-16 10:42:53Z ERR GitHubActionsService] POST request to https://pipelines.actions.githubusercontent.com/HCYdTxD8O2BMG4LvM5MKcb35EY0sH1wNedn0yWzJce2QlAajYJ/00000000-0000-0000-0000-000000000000/_apis/distributedtask/hubs/Actions/plans/3c6e1db1-0d9c-4bbe-b2fe-376050e30856/events failed. HTTP Status: BadRequest, AFD Ref: Ref A: 0FFF862B7A304ECE8208266780675163 Ref B: MIL30EDGE1321 Ref C: 2022-09-16T10:42:53Z [2022-09-16 10:42:53Z ERR JobRunner] TaskOrchestrationPlanTerminatedException received, while attempting to raise JobCompletedEvent for job ca395085-040a-526b-2ce8-bdc85f692774. [2022-09-16 10:42:53Z ERR JobRunner] GitHub.DistributedTask.WebApi.TaskOrchestrationPlanTerminatedException: Orchestration plan 3c6e1db1-0d9c-4bbe-b2fe-376050e30856 is not in a runnable state. at GitHub.Services.WebApi.VssHttpClientBase.HandleResponseAsync(HttpResponseMessage response, CancellationToken cancellationToken) at GitHub.Services.WebApi.VssHttpClientBase.SendAsync(HttpRequestMessage message, HttpCompletionOption completionOption, Object userState, CancellationToken cancellationToken) at GitHub.Services.WebApi.VssHttpClientBase.SendAsync(HttpMethod method, Guid locationId, Object routeValues, ApiResourceVersion version, HttpContent content, IEnumerable1 queryParameters, Object userState, CancellationToken cancellationToken) at GitHub.DistributedTask.WebApi.TaskHttpClient.RaisePlanEventAsync[T](Guid scopeIdentifier, String planType, Guid planId, T eventData, CancellationToken cancellationToken, Object userState) at GitHub.Runner.Worker.JobRunner.CompleteJobAsync(IJobServer jobServer, IExecutionContext jobContext, AgentJobRequestMessage message, Nullable1 taskResult)

still waiting for a solution :(

lankmiler · 2022-09-23T08:02:18Z

@zaknafein83 did you resolved it?
@AvaStancu I have the same issue now. We're using ec2 instance and we're stopping at and of workflow and starting at the start of it.

lankmiler · 2022-09-23T08:07:07Z

[2022-09-22 11:19:56Z INFO HostContext] Well known directory 'Work': '/home/ec2-user/actions-runner/_work'
[2022-09-22 11:19:57Z INFO JobServer] Caught exception during append web console line to websocket, let's fallback to sending via non-websocket call (total calls: 21, failed calls: 1, websocket state: Open).
[2022-09-22 11:19:57Z ERR  JobServer] System.Net.WebSockets.WebSocketException (2): The remote party closed the WebSocket connection without completing the close handshake. ---> System.IO.IOException: Unable to write data to the transport connection: Broken pipe.
 ---> System.Net.Sockets.SocketException (32): Broken pipe
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.CreateException(SocketError error, Boolean forAsyncThrow)
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.SendAsyncForNetworkStream(Socket socket, CancellationToken cancellationToken)
   at System.Net.Sockets.NetworkStream.WriteAsync(ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
   at System.Net.Security.SslStream.WriteSingleChunk[TIOAdapter](TIOAdapter writeAdapter, ReadOnlyMemory`1 buffer)
   at System.Net.Security.SslStream.WriteAsyncInternal[TIOAdapter](TIOAdapter writeAdapter, ReadOnlyMemory`1 buffer)
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
   at System.Net.Security.SslStream.WriteAsync(ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnection.WriteToStreamAsync(ReadOnlyMemory`1 source, Boolean async)
   at System.Net.Http.HttpConnection.WriteWithoutBufferingAsync(ReadOnlyMemory`1 source, Boolean async)
   at System.Net.Http.HttpConnection.RawConnectionStream.WriteAsync(ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
   at System.Net.WebSockets.ManagedWebSocket.SendFrameFallbackAsync(MessageOpcode opcode, Boolean endOfMessage, Boolean disableCompression, ReadOnlyMemory`1 payloadBuffer, Task lockTask, CancellationToken cancellationToken)
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
   at System.Net.WebSockets.ManagedWebSocket.SendFrameFallbackAsync(MessageOpcode opcode, Boolean endOfMessage, Boolean disableCompression, ReadOnlyMemory`1 payloadBuffer, Task lockTask, CancellationToken cancellationToken)
   at System.Net.WebSockets.ManagedWebSocket.SendFrameAsync(MessageOpcode opcode, Boolean endOfMessage, Boolean disableCompression, ReadOnlyMemory`1 payloadBuffer, CancellationToken cancellationToken)
   at System.Net.WebSockets.ManagedWebSocket.SendAsync(ReadOnlyMemory`1 buffer, WebSocketMessageType messageType, WebSocketMessageFlags messageFlags, CancellationToken cancellationToken)
   at System.Net.WebSockets.ManagedWebSocket.SendAsync(ArraySegment`1 buffer, WebSocketMessageType messageType, Boolean endOfMessage, CancellationToken cancellationToken)
   at GitHub.Runner.Common.JobServer.AppendTimelineRecordFeedAsync(Guid scopeIdentifier, String hubName, Guid planId, Guid timelineId, Guid timelineRecordId, Guid stepId, IList`1 lines, Nullable`1 startLine, CancellationToken cancellationToken)
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
   at GitHub.Runner.Common.JobServer.AppendTimelineRecordFeedAsync(Guid scopeIdentifier, String hubName, Guid planId, Guid timelineId, Guid timelineRecordId, Guid stepId, IList`1 lines, Nullable`1 startLine, CancellationToken cancellationToken)
   at GitHub.Runner.Common.JobServerQueue.ProcessWebConsoleLinesQueueAsync(Boolean runOnce)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(IAsyncStateMachineBox box, Boolean allowInlining)
   at System.Threading.Tasks.Task.RunContinuations(Object continuationObject)
   at System.Threading.Tasks.Task.DelayPromise.CompleteTimedOut()
   at System.Threading.TimerQueueTimer.Fire(Boolean isThreadPool)
   at System.Threading.TimerQueue.FireNextTimers()
   at System.Threading.UnmanagedThreadPoolWorkItem.ExecuteUnmanagedThreadPoolWorkItem(IntPtr callback, IntPtr state)
   at System.Threading.UnmanagedThreadPoolWorkItem.ExecuteUnmanagedThreadPoolWorkItem(IntPtr callback, IntPtr state)
   at System.Threading.UnmanagedThreadPoolWorkItem.System.Threading.IThreadPoolWorkItem.Execute()
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
   at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
   at System.Threading.Thread.StartCallback()
--- End of stack trace from previous location ---

zaknafein83 · 2022-09-23T08:49:33Z

@zaknafein83 did you resolved it? @AvaStancu I have the same issue now. We're using ec2 instance and we're stopping at and of workflow and starting at the start of it.

not yet, I must restart my instance every time

chantra · 2022-09-27T22:22:36Z

[taking my comment out, I think I got some logs mixed up]

Benikz · 2022-12-01T09:52:31Z

Hello all,

I would like to point out the big issue here.
The lost communication issue seems to be seen all over the place (see above).
We are hosting our own runners now (Github Enterprise) and we see this very often, but we cannot pinpoint the root cause.

We have enough RAM and good internet connection. We think that the runners do not receive enough CPU time when we build applications. Although we expect the connection to stabilize sooner or later.
For testing purpose, we tried to increase niceness of the run.sh script. This seems to not have any positive effect.
The runners are ephemeral, although this should not be an issue.

The run.sh script does not fail or succeed. It just stops without any clear error.

Whatever the reason: We expect runners to be as stable as Jenkins nodes. It should not matter if the system is overloaded, when building. The connection is dying randomly. Sometimes our builds even succeed, as expected. But sometimes, they just die.

Can this issue be emphasized more, please?
This is a core feature: connection stability. If this is not possible, then runners are simply unusable.

Sorry for being harsh, but this is literally an issue for months now.

Stay healthy!

BR

Edit:
We use:

Runner version 2.293.1
Running on Ubuntu 20.04 VMs (AMD64)

Benikz · 2022-12-05T14:22:41Z

I did some more investigation and apparently it was a problem on our side while instantiating the runner via a VM through systemd management. The problem was a mixture of how our VM solution works with systemd.

I am not sure about the others now... It seems to be stable now, after fixing our services.

For the interested people:
We use Vagrant, which uses virtual box under the hood. Systemd instatiation works pretty fine (@ symbol in the name), but Virtualbox uses a service process, that is bound to only one systemd service (CGroup). When the systemd instance with the Vbox service was done, it stopped all of the other boxes, which then lead to the communication problem. This makes sense, as the communication to the provider was simply cut off.
It has nothing to do with the runner issue mentioned in this post, but I wanted to point out that this issue was caused by our own backend. I would suggest the people here to check how they create their Runners and maybe they are being stopped preliminary by something else.
Sorry for being harsh again.

Best regards!

ololobus · 2022-12-29T14:07:58Z

We started having the same issues ~1 month ago, we use custom-sized GitHub-hosted runners, though. Not sure, but it feels like it happens more often with 16 core runners.

The message is:

The hosted runner: XXX lost communication with the server. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.

If I try to get raw logs, they are almost empty, although a few job steps succeeded:

2022-12-29T13:14:48.9912020Z Requested labels: XXX
2022-12-29T13:14:48.9912099Z Job defined at: yyy/xxx/.github/workflows/testing.yml@refs/pull/1111/merge
2022-12-29T13:14:48.9912131Z Waiting for a runner to pick up this job...
2022-12-29T13:17:11.0365729Z Job is about to start running on the runner: XXX (organization)

Not sure this is related, but initially they were defined as Ubuntu 20.04 runners, but after December 15 they started using 22.04, the warning was

Runner 'XXX' will start to use Ubuntu 22.04 starting from 15 December

UPD: I 'fixed' this by just re-creating the runners group in the GitHub UI. I.e. we had gha-ubuntu-20.04-16cores (automatically upgraded to 22.04 by GitHub), so I created and used gha-ubuntu-22.04-8cores instead. And it magically helped, all runs are passing now without any problems. Leaving it here as it may help someone.

And this makes me wonder, why? I thought that runners group is just some stateless abstraction to limit usage, but it appears to be something statefull, i.e. it binds to some infra (?), so if it has some problems -- you will have too, and re-creating the group may help.

jbkc85 · 2023-01-16T16:25:31Z

I am now having this experience with self-hosted runners in AWS with no apparent cause. Disk is fine, Mem is fine, CPU is fine - but just randomly a GitHub runner decides it can no longer talk with the GitHub web sockets and fails to reconnect.

That being said, I see numerous times (5-10%) of the web socket connections during a workflow run are error'ing out and causing the web socket process to reconnect. Not sure if this is related, or a red herring.

…sue:actions/runner#2040 Signed-off-by: Michael Shitrit <[email protected]>

…e out issue:actions/runner#2040" This reverts commit dda4d0c.

…sue:actions/runner#2040 Signed-off-by: Michael Shitrit <[email protected]>

chtompki · 2023-03-13T17:22:21Z

+1 to this thread - and I'm even using --ephemeral runners which should accommodate for one job on one runner. I'm thinking about specifically stacking jobs onto a single runner with metadata and then deleting that runner when everything is done, but that defeats the purpose

parker-vv · 2023-03-14T16:50:19Z

+1 same issue.

kirillmorozov · 2023-04-05T11:24:55Z

Faced the same issue when using AWS EC2 instances as self-hosted runners.

The self-hosted runner: i-020cc48127fe3f0bc lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.

densto88 · 2023-04-10T16:09:43Z

We're seeing this in hosted runners also, here's the stack trace from our worker logs...

[2023-04-09 18:18:04Z ERR  JobServer] #####################################################
[2023-04-09 18:18:04Z ERR  JobServer] System.IO.IOException: Unable to write data to the transport connection: Broken pipe.
 ---> System.Net.Sockets.SocketException (32): Broken pipe
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.CreateException(SocketError error, Boolean forAsyncThrow)
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.SendAsyncForNetworkStream(Socket socket, CancellationToken cancellationToken)
   at System.Net.Sockets.NetworkStream.WriteAsync(ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
   at System.Net.Security.SslStream.WriteSingleChunk[TIOAdapter](TIOAdapter writeAdapter, ReadOnlyMemory`1 buffer)
   at System.Net.Security.SslStream.WriteAsyncInternal[TIOAdapter](TIOAdapter writeAdapter, ReadOnlyMemory`1 buffer)
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
   at System.Net.Security.SslStream.WriteAsync(ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnection.WriteToStreamAsync(ReadOnlyMemory`1 source, Boolean async)
   at System.Net.Http.HttpConnection.RawConnectionStream.WriteAsync(ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
   at System.Net.WebSockets.ManagedWebSocket.SendFrameFallbackAsync(MessageOpcode opcode, Boolean endOfMessage, Boolean disableCompression, ReadOnlyMemory`1 payloadBuffer, Task lockTask, CancellationToken cancellationToken)
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
   at System.Net.WebSockets.ManagedWebSocket.SendFrameFallbackAsync(MessageOpcode opcode, Boolean endOfMessage, Boolean disableCompression, ReadOnlyMemory`1 payloadBuffer, Task lockTask, CancellationToken cancellationToken)
   at System.Net.WebSockets.ManagedWebSocket.SendAsync(ReadOnlyMemory`1 buffer, WebSocketMessageType messageType, WebSocketMessageFlags messageFlags, CancellationToken cancellationToken)
   at System.Net.WebSockets.ManagedWebSocket.SendAsync(ArraySegment`1 buffer, WebSocketMessageType messageType, Boolean endOfMessage, CancellationToken cancellationToken)
   at GitHub.Runner.Common.JobServer.AppendTimelineRecordFeedAsync(Guid scopeIdentifier, String hubName, Guid planId, Guid timelineId, Guid timelineRecordId, Guid stepId, IList`1 lines, Nullable`1 startLine, CancellationToken cancellationToken)
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
   at GitHub.Runner.Common.JobServer.AppendTimelineRecordFeedAsync(Guid scopeIdentifier, String hubName, Guid planId, Guid timelineId, Guid timelineRecordId, Guid stepId, IList`1 lines, Nullable`1 startLine, CancellationToken cancellationToken)
   at GitHub.Runner.Common.JobServerQueue.ProcessWebConsoleLinesQueueAsync(Boolean runOnce)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(IAsyncStateMachineBox box, Boolean allowInlining)
   at System.Threading.Tasks.Task.RunContinuations(Object continuationObject)
   at System.Threading.Tasks.Task.DelayPromise.CompleteTimedOut()
   at System.Threading.TimerQueueTimer.Fire(Boolean isThreadPool)
   at System.Threading.TimerQueue.FireNextTimers()
   at System.Threading.UnmanagedThreadPoolWorkItem.ExecuteUnmanagedThreadPoolWorkItem(IntPtr callback, IntPtr state)
   at System.Threading.UnmanagedThreadPoolWorkItem.ExecuteUnmanagedThreadPoolWorkItem(IntPtr callback, IntPtr state)
   at System.Threading.UnmanagedThreadPoolWorkItem.System.Threading.IThreadPoolWorkItem.Execute()
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
   at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
--- End of stack trace from previous location ---

   --- End of inner exception stack trace ---
   at System.Net.Security.SslStream.<WriteSingleChunk>g__CompleteWriteAsync|182_1[TIOAdapter](ValueTask writeTask, Byte[] bufferToReturn)
   at System.Net.Security.SslStream.WriteAsyncInternal[TIOAdapter](TIOAdapter writeAdapter, ReadOnlyMemory`1 buffer)
   at System.Net.WebSockets.ManagedWebSocket.SendFrameFallbackAsync(MessageOpcode opcode, Boolean endOfMessage, Boolean disableCompression, ReadOnlyMemory`1 payloadBuffer, Task lockTask, CancellationToken cancellationToken)
[2023-04-09 18:18:04Z ERR  JobServer] #####################################################
[2023-04-09 18:18:04Z ERR  JobServer] System.Net.Sockets.SocketException (32): Broken pipe
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.CreateException(SocketError error, Boolean forAsyncThrow)
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.SendAsyncForNetworkStream(Socket socket, CancellationToken cancellationToken)
   at System.Net.Sockets.NetworkStream.WriteAsync(ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
   at System.Net.Security.SslStream.WriteSingleChunk[TIOAdapter](TIOAdapter writeAdapter, ReadOnlyMemory`1 buffer)
   at System.Net.Security.SslStream.WriteAsyncInternal[TIOAdapter](TIOAdapter writeAdapter, ReadOnlyMemory`1 buffer)
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
   at System.Net.Security.SslStream.WriteAsync(ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnection.WriteToStreamAsync(ReadOnlyMemory`1 source, Boolean async)
   at System.Net.Http.HttpConnection.RawConnectionStream.WriteAsync(ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
   at System.Net.WebSockets.ManagedWebSocket.SendFrameFallbackAsync(MessageOpcode opcode, Boolean endOfMessage, Boolean disableCompression, ReadOnlyMemory`1 payloadBuffer, Task lockTask, CancellationToken cancellationToken)
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
   at System.Net.WebSockets.ManagedWebSocket.SendFrameFallbackAsync(MessageOpcode opcode, Boolean endOfMessage, Boolean disableCompression, ReadOnlyMemory`1 payloadBuffer, Task lockTask, CancellationToken cancellationToken)
   at System.Net.WebSockets.ManagedWebSocket.SendAsync(ReadOnlyMemory`1 buffer, WebSocketMessageType messageType, WebSocketMessageFlags messageFlags, CancellationToken cancellationToken)
   at System.Net.WebSockets.ManagedWebSocket.SendAsync(ArraySegment`1 buffer, WebSocketMessageType messageType, Boolean endOfMessage, CancellationToken cancellationToken)
   at GitHub.Runner.Common.JobServer.AppendTimelineRecordFeedAsync(Guid scopeIdentifier, String hubName, Guid planId, Guid timelineId, Guid timelineRecordId, Guid stepId, IList`1 lines, Nullable`1 startLine, CancellationToken cancellationToken)
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
   at GitHub.Runner.Common.JobServer.AppendTimelineRecordFeedAsync(Guid scopeIdentifier, String hubName, Guid planId, Guid timelineId, Guid timelineRecordId, Guid stepId, IList`1 lines, Nullable`1 startLine, CancellationToken cancellationToken)
   at GitHub.Runner.Common.JobServerQueue.ProcessWebConsoleLinesQueueAsync(Boolean runOnce)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(IAsyncStateMachineBox box, Boolean allowInlining)
   at System.Threading.Tasks.Task.RunContinuations(Object continuationObject)
   at System.Threading.Tasks.Task.DelayPromise.CompleteTimedOut()
   at System.Threading.TimerQueueTimer.Fire(Boolean isThreadPool)
   at System.Threading.TimerQueue.FireNextTimers()
   at System.Threading.UnmanagedThreadPoolWorkItem.ExecuteUnmanagedThreadPoolWorkItem(IntPtr callback, IntPtr state)
   at System.Threading.UnmanagedThreadPoolWorkItem.ExecuteUnmanagedThreadPoolWorkItem(IntPtr callback, IntPtr state)
   at System.Threading.UnmanagedThreadPoolWorkItem.System.Threading.IThreadPoolWorkItem.Execute()
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
   at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
--- End of stack trace from previous location ---

[2023-04-09 18:18:04Z INFO JobServer] Websocket is not open, let's attempt to connect back again with random backoff 00:00:00.2370000 ms (total calls: 159, failed calls: 12).

kirillmorozov · 2023-04-11T07:20:03Z

Update to my case:

I was able to resolve the issue by using larger EC2 instances, so yeah, CPU/Memory starvation was the cause of this problem.

Togtja · 2024-01-19T09:28:34Z

We are also experiencing similar problems.
We are aslo running it as --ephemeral, we have 4 build in parallel compiling C++ code.
The communiction loss seems to mostly happend during an artefact upload to GitHub using the Upload Action (https://github.com/actions/upload-artifact). However, it has also lost connection duing a post-checkout stage.
We have attempted to use some EC2 machines that are an insanly overkill for the task, however, the problem still seems to presist.

Since we are running 4 build at the time, it nearly always fails in one of them. 3 are Ubuntu based and 1 is Windows, but it does not seem to just be affecting 1 type of OS. The GitHub runner version is always the newest, as they are created via a script that fetches the newset runner.

Our current "solution" is just to "rebuild failed jobs" untill it works. However, longterm this is unacceptable

Similar to #2624 (comment)

machulav · 2024-02-23T09:59:12Z

We have the same issue, which happens from time to time with our runners

jbkc85 · 2024-02-23T16:09:32Z

For my particular scenario, the web socket errors are a red herring and aren't necessarily associated with the random loss of a runner.

IF in AWS and running on SPOT instances, depending on the SPOT instance settings and autoscaling, its very possible that the SPOT instances are heavily associated with the random loss of a GH runner instance. In our particular case, we went from SPOT to ON_DEMAND node groups and went from 30-40% failure rates to 0.01%

signor-mike · 2024-06-27T16:35:59Z

I downgraded Ubuntu from 22.04 LTS to 20.04 LTS and the workflow is no longer exhausting anything.

Attempting to fix actions/runner#2040

hamidgg added the bug Something isn't working label Aug 5, 2022

nikola-jokic added Runner Bug Bug fix scope to the runner needs-investigation labels Aug 8, 2022

AvaStancu assigned AvaStancu and unassigned AvaStancu Aug 9, 2022

mshitrit added a commit to mshitrit/self-node that referenced this issue Jan 23, 2023

trying to force change run group in order to 'fix' github time out is…

dda4d0c

…sue:actions/runner#2040 Signed-off-by: Michael Shitrit <[email protected]>

mshitrit added a commit to mshitrit/self-node that referenced this issue Jan 23, 2023

Revert "trying to force change run group in order to 'fix' github tim…

92239a9

…e out issue:actions/runner#2040" This reverts commit dda4d0c.

mshitrit added a commit to mshitrit/self-node that referenced this issue Jan 24, 2023

trying to force change run group in order to 'fix' github time out is…

e156a81

…sue:actions/runner#2040 Signed-off-by: Michael Shitrit <[email protected]>

jaimergp mentioned this issue Oct 18, 2023

Try GPU CI with cupy (DNM) conda-forge/cf-autotick-bot-test-package-feedstock#466

Closed

8 tasks

AlexandreSinger mentioned this issue Jul 19, 2024

[CI] Investigating CI Runners verilog-to-routing/vtr-verilog-to-routing#2652

Open

daijro added a commit to daijro/camoufox that referenced this issue Aug 16, 2024

CI/CD: Downgrade to ubuntu-20.04

5574b53

Attempting to fix actions/runner#2040

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workflow failure due to runner shutdown/stoppage #2040

Workflow failure due to runner shutdown/stoppage #2040

hamidgg commented Aug 5, 2022 •

edited

Loading

AvaStancu commented Aug 10, 2022

liu-shaojun commented Aug 16, 2022 •

edited

Loading

seantleonard commented Aug 17, 2022

hamidgg commented Aug 17, 2022

zaknafein83 commented Sep 16, 2022 •

edited

Loading

lankmiler commented Sep 23, 2022

lankmiler commented Sep 23, 2022

zaknafein83 commented Sep 23, 2022

chantra commented Sep 27, 2022 •

edited

Loading

Benikz commented Dec 1, 2022 •

edited

Loading

Benikz commented Dec 5, 2022

ololobus commented Dec 29, 2022 •

edited

Loading

jbkc85 commented Jan 16, 2023

chtompki commented Mar 13, 2023

parker-vv commented Mar 14, 2023

kirillmorozov commented Apr 5, 2023 •

edited

Loading

densto88 commented Apr 10, 2023

kirillmorozov commented Apr 11, 2023

Togtja commented Jan 19, 2024 •

edited

Loading

machulav commented Feb 23, 2024

jbkc85 commented Feb 23, 2024

signor-mike commented Jun 27, 2024

Workflow failure due to runner shutdown/stoppage #2040

Workflow failure due to runner shutdown/stoppage #2040

Comments

hamidgg commented Aug 5, 2022 • edited Loading

Description

Log

Runner Version and Platform

AvaStancu commented Aug 10, 2022

liu-shaojun commented Aug 16, 2022 • edited Loading

seantleonard commented Aug 17, 2022

hamidgg commented Aug 17, 2022

zaknafein83 commented Sep 16, 2022 • edited Loading

lankmiler commented Sep 23, 2022

lankmiler commented Sep 23, 2022

zaknafein83 commented Sep 23, 2022

chantra commented Sep 27, 2022 • edited Loading

Benikz commented Dec 1, 2022 • edited Loading

Benikz commented Dec 5, 2022

ololobus commented Dec 29, 2022 • edited Loading

jbkc85 commented Jan 16, 2023

chtompki commented Mar 13, 2023

parker-vv commented Mar 14, 2023

kirillmorozov commented Apr 5, 2023 • edited Loading

densto88 commented Apr 10, 2023

kirillmorozov commented Apr 11, 2023

Togtja commented Jan 19, 2024 • edited Loading

machulav commented Feb 23, 2024

jbkc85 commented Feb 23, 2024

signor-mike commented Jun 27, 2024

hamidgg commented Aug 5, 2022 •

edited

Loading

liu-shaojun commented Aug 16, 2022 •

edited

Loading

zaknafein83 commented Sep 16, 2022 •

edited

Loading

chantra commented Sep 27, 2022 •

edited

Loading

Benikz commented Dec 1, 2022 •

edited

Loading

ololobus commented Dec 29, 2022 •

edited

Loading

kirillmorozov commented Apr 5, 2023 •

edited

Loading

Togtja commented Jan 19, 2024 •

edited

Loading