Skip to content

Intermittent runner communication loss on AWS Fargate (ephemeral, ~2% failure rate) #4498

@vendys-sangheun

Description

@vendys-sangheun

Describe the bug

An ephemeral self-hosted runner hosted on AWS Fargate (Ubuntu 24.04) intermittently loses
communication with the GitHub Actions server shortly after a job starts. This does not happen
consistently — it occurs roughly once every 50 job runs. When it fails, the Fargate container
logs are truncated immediately after the "Running job" line, and the runner never produces
a completion log. The GitHub UI reports a communication loss error.

To Reproduce

This is an intermittent issue and cannot be reliably reproduced on demand. General setup:

  1. Register an ephemeral self-hosted runner on AWS Fargate (Ubuntu 24.04, arm64).
  2. Trigger a workflow that dispatches a job to the runner.
  3. Approximately 1 out of 50 runs, the runner stops responding after picking up the job.

Expected behavior

The runner should complete the job and produce the following log sequence:

Job completed with result: Succeeded
√ Removed .credentials
√ Removed .runner
Runner listener exit with 0 return code, stop the service, no retry needed.
Exiting runner…
Runner exited with code: 0

Runner Version and Platform

  • Runner version: actions-runner-linux-arm64-2.311.0
  • OS: Ubuntu 24.04 (ARM64), running inside AWS Fargate
  • Runner mode: Ephemeral (--ephemeral)

What's not working?

The runner loses communication with the GitHub server after picking up a job.

GitHub UI error:

The self-hosted runner lost communication with the server. Verify the machine is running
and has a healthy network connection. Anything in your workflow that terminates the runner
process, starves it for CPU/Memory, or blocks its network access can cause this error.

Notes:

  • Memory and disk usage are logged every 5 seconds; no anomalies observed around failure time.
  • No signs of OOM, CPU starvation, or network disruption in Fargate metrics.
  • The runner container exits without producing a completion log or cleanup output.

Job Log Output

Failing run — Fargate container log ends abruptly at:
2026-05-21 01:03:53Z: Running job: ci-cd / dev-backend-ci-cd
(No further output. Container exits.)

Successful run — Fargate container log continues normally:
2026-05-21 01:03:53Z: Running job: ci-cd / dev-backend-ci-cd
2026-06-11 00:17:26Z: Job ci-cd / dev-backend-ci-cd completed with result: Succeeded
√ Removed .credentials
√ Removed .runner
Runner listener exit with 0 return code, stop the service, no retry needed.
Exiting runner…
Runner exited with code: 0

Runner and Worker's Diagnostic Logs

Diagnostic logs from the _diag folder were collected and reviewed. However, the logs only
provide more detailed output up to the following line — nothing is logged after this point:

2026-05-21 01:03:53Z: Running job: ci-cd / dev-backend-ci-cd

The diagnostic logs confirm the runner picked up the job successfully, but provide no
information about what caused the runner to stop responding afterward. The root cause
of the hang/exit after job dispatch remains unidentified from the available logs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions