Intermittent runner communication loss on AWS Fargate (ephemeral, ~2% failure rate)

**Describe the bug**

An ephemeral self-hosted runner hosted on AWS Fargate (Ubuntu 24.04) intermittently loses
communication with the GitHub Actions server shortly after a job starts. This does not happen
consistently — it occurs roughly once every 50 job runs. When it fails, the Fargate container
logs are truncated immediately after the "Running job" line, and the runner never produces
a completion log. The GitHub UI reports a communication loss error.

**To Reproduce**

This is an intermittent issue and cannot be reliably reproduced on demand. General setup:

1. Register an ephemeral self-hosted runner on AWS Fargate (Ubuntu 24.04, arm64).
2. Trigger a workflow that dispatches a job to the runner.
3. Approximately 1 out of 50 runs, the runner stops responding after picking up the job.

**Expected behavior**

The runner should complete the job and produce the following log sequence:

Job  completed with result: Succeeded
√ Removed .credentials
√ Removed .runner
Runner listener exit with 0 return code, stop the service, no retry needed.
Exiting runner…
Runner exited with code: 0

## Runner Version and Platform

- **Runner version:** `actions-runner-linux-arm64-2.311.0`
- **OS:** Ubuntu 24.04 (ARM64), running inside AWS Fargate
- **Runner mode:** Ephemeral (`--ephemeral`)

## What's not working?

The runner loses communication with the GitHub server after picking up a job.

**GitHub UI error:**
> The self-hosted runner lost communication with the server. Verify the machine is running
> and has a healthy network connection. Anything in your workflow that terminates the runner
> process, starves it for CPU/Memory, or blocks its network access can cause this error.

**Notes:**
- Memory and disk usage are logged every 5 seconds; no anomalies observed around failure time.
- No signs of OOM, CPU starvation, or network disruption in Fargate metrics.
- The runner container exits without producing a completion log or cleanup output.

## Job Log Output

**Failing run** — Fargate container log ends abruptly at:
2026-05-21 01:03:53Z: Running job: ci-cd / dev-backend-ci-cd
*(No further output. Container exits.)*

**Successful run** — Fargate container log continues normally:
2026-05-21 01:03:53Z: Running job: ci-cd / dev-backend-ci-cd
2026-06-11 00:17:26Z: Job ci-cd / dev-backend-ci-cd completed with result: Succeeded
√ Removed .credentials
√ Removed .runner
Runner listener exit with 0 return code, stop the service, no retry needed.
Exiting runner…
Runner exited with code: 0

## Runner and Worker's Diagnostic Logs

Diagnostic logs from the `_diag` folder were collected and reviewed. However, the logs only
provide more detailed output **up to** the following line — nothing is logged after this point:

2026-05-21 01:03:53Z: Running job: ci-cd / dev-backend-ci-cd

The diagnostic logs confirm the runner picked up the job successfully, but provide no
information about what caused the runner to stop responding afterward. The root cause
of the hang/exit after job dispatch remains unidentified from the available logs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent runner communication loss on AWS Fargate (ephemeral, ~2% failure rate) #4498

Runner Version and Platform

What's not working?

Job Log Output

Runner and Worker's Diagnostic Logs

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Intermittent runner communication loss on AWS Fargate (ephemeral, ~2% failure rate) #4498

Description

Runner Version and Platform

What's not working?

Job Log Output

Runner and Worker's Diagnostic Logs

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions