Releases failing randomly with agent error "We stopped hearing from agent" #3994

BenH-Puregym · 2022-10-17T15:32:23Z

As per this issue which is now closed. #3855

This is still happening I'm afraid. Didn't happen for a couple weeks and now we've had multiple instances of it happen in the last week. I've been able to capture some more information this time.

So when a job finishes devops should delete the agent so that we have short lived agents... By the looks of the logs on this occurrence it seems when the job finished instead of deleting immediately like it does on other agents (i can see this by looking at the logs) it failed over and over again to delete the agent and eventually succeeds. In the meantime another job on a totally different pipeline decides to pick the same agent and of course once the agent is deleted we get the We stopped hearing from agent error on the new job.
Logs from failed agent:

09:21:55.410
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:21:55.410363969Z stdout F 2022-10-11 09:21:55Z: Running job: Build app
 
09:30:19.771
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:19.765584908Z stdout F 2022-10-11 09:30:19Z: Job Build app completed with result: Succeeded
 
09:30:19.985
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:19.985180004Z stdout F Agent exit code 0
 
09:30:19.985
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:19.98571461Z stdout F �[1;36mCleanup. Removing Azure Pipelines agent...�[0m
 
09:30:20.885
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:20.787535925Z stdout F Removing agent from the server
 
09:30:21.743
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:21.499530418Z stdout F Connecting to server ...
 
09:30:22.140
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:22.137644272Z stdout F Error reported in diagnostic logs. Please examine the log for more details.
 
09:30:22.140
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:22.137702573Z stdout F     - /azp/_diag/Agent_20221011-093020-utc.log
 
09:30:22.172
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:22.156237083Z stdout F Failed: Removing agent from the server
 
09:30:22.178
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:22.178472936Z stderr F Agent "ldevops-rqmlf--1-8hzwp" is running a job for pool "ubuntu-pool"
 
09:30:22.227
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:22.227043288Z stdout F Retrying in 30 seconds...


09:30:53.019
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:53.019535413Z stdout F Removing agent from the server
 
09:30:53.741
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:53.741424587Z stdout F Connecting to server ...
 
09:30:54.258
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:54.254130792Z stdout F Error reported in diagnostic logs. Please examine the log for more details.
 
09:30:54.258
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:54.254202393Z stdout F     - /azp/_diag/Agent_20221011-093052-utc.log
 
09:30:54.270
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:54.270176374Z stdout F Failed: Removing agent from the server
 
09:30:54.290
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:54.289914798Z stderr F Agent "ldevops-rqmlf--1-8hzwp" is running a job for pool "ubuntu-pool"
 
09:30:54.309
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:54.309693622Z stdout F Retrying in 30 seconds...
 
09:31:25.027
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:31:25.027092346Z stdout F Removing agent from the server
 
09:31:25.652
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:31:25.652210432Z stdout F Connecting to server ...
 
 
09:31:26.192
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:31:26.191664061Z stdout F Error reported in diagnostic logs. Please examine the log for more details.
 
09:31:26.192
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:31:26.191757062Z stdout F     - /azp/_diag/Agent_20221011-093124-utc.log
 
09:31:26.210
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:31:26.21034287Z stdout F Failed: Removing agent from the server

The text was updated successfully, but these errors were encountered:

vmapetr · 2022-10-17T23:58:52Z

Hi @BenH-Puregym!
Could you please confirm - is it correct, that agents which don't get purged after the job execution, are stuck despite the job outcome, or initially they have the jobs with the same Invalid character after parsing property name error from the parent issue?

BenH-Puregym · 2022-10-18T08:24:27Z

Hi @vmapetr , Yes that's right, I haven't seen this Invalid character after parsing property name error.

vmapetr · 2022-10-18T18:23:33Z

@BenH-Puregym so it seems the issue is not coming from the communication between the agent and AzDO this time. Could you please clarify - where the cleanup logs that you mentioned in the issue description are coming from? AFAIK, the AzureDevops itself is not managing the agent orchestration in the scope of complete agent deletion, so seems like you got this from the AKS scaler or KEDA, right?
And also, there is a possibility to extract mentioned diag logs from the agents that fail to clean up before the agent is completely purged?

BenH-Puregym · 2022-10-19T07:48:36Z

We have enrolled the cluster into new relic to get these logs.
We use Keda to watch the queue and deploy/delete the jobs when devops mark them as completion but ofc it's not down to keda to delete the agent from devops, this is what happens in the script MS provide

I have added the --once flag to the script as recommended in this scenario so at the end of a job it should delete the agent which notifies K8 that the container has completed.

vmapetr · 2022-10-25T09:31:49Z

@BenH-Puregym From what it seems, the agent has some intermittent network issues, which is expected, but while working with the --once flag we facing the issue when ADO itself does not know that agent is preparing for cleanup. We are working on a solution right now.
in a meantime, could you please provide the content of the /azp/_diag/ folder from the failed machine? By those logs, we can provide a temporary workaround.

BenH-Puregym · 2022-10-31T09:28:32Z

@vmapetr that's really great to know that it's a problem with ADO rather than us.
I'd love to provide the content you're asking for but it's one of those annoying issues that sometimes happen multiple times a day and then might not happen for 1-2 weeks.
Isn't that folder going to display the same as what I sent above or what you see in the pipeline output?

lkt82 · 2022-11-01T10:36:22Z

Hi we am seeing this behavior on a AKS/KEDA setup as well.

For us it's easily provoked by querying 20+ pipelines runs

BenH-Puregym · 2022-11-18T15:47:39Z

Hi @vmapetr, have you had much luck in finding the cause? We're still getting the error multiple times a day every so often.

darren-mcdonald · 2023-01-20T08:55:53Z

Hi @vmapetr has there been any progress on this?

ericyew · 2023-05-03T02:52:10Z

Any update on a fix for this?

yys2000 · 2023-06-21T03:02:12Z

We also randomly experienced this issue when we used the self-hosted container app build agent.

ericyew · 2023-07-04T07:09:55Z

@BenH-Puregym From what it seems, the agent has some intermittent network issues, which is expected, but while working with the --once flag we facing the issue when ADO itself does not know that agent is preparing for cleanup. We are working on a solution right now. in a meantime, could you please provide the content of the /azp/_diag/ folder from the failed machine? By those logs, we can provide a temporary workaround.

@vmapetr is this still being looked at? Any solution coming for this?

github-actions · 2023-12-31T08:02:35Z

This issue has had no activity in 180 days. Please comment if it is not actually stale

ericyew · 2024-01-14T23:29:21Z

This is not resolved yet

andrewhaine1 · 2024-04-05T06:36:35Z

I also encounter this randomly with ACA Container App Jobs. I thought that perhaps AKS would be the solution as there would be a lot more flexibility in troubleshooting this problem, but it seems as if the same would be the case on AKS as @lkt82 has pointed out.

Az8th · 2024-05-02T09:04:27Z

#4313 (comment)
This may fix the "We stopped hearing from agent" error !

LeaCCC · 2024-05-24T03:10:55Z

this issue still happens with self-hosted windows agent. I just logged my bug here: #4813

github-actions bot added Area: Agent triage labels Oct 17, 2022

LiliaSabitova removed the triage label Oct 17, 2022

vmapetr self-assigned this Oct 17, 2022

KonstantinTyukalov added the Kubernetes Issues related to AKS, KEAD, etc. label Oct 24, 2022

mustjab mentioned this issue Jun 2, 2023

Safari runs missing for week of May 15th, 2023 web-platform-tests/wpt#40085

Closed

github-actions bot added the stale label Dec 31, 2023

github-actions bot closed this as completed Jan 7, 2024

LeaCCC mentioned this issue May 26, 2024

[BUG]: self-hosted agent - We stopped hearing from agent agent - when running ReadyAPI tests in parallel #4813

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases failing randomly with agent error "We stopped hearing from agent" #3994

Releases failing randomly with agent error "We stopped hearing from agent" #3994

BenH-Puregym commented Oct 17, 2022

vmapetr commented Oct 17, 2022

BenH-Puregym commented Oct 18, 2022

vmapetr commented Oct 18, 2022

BenH-Puregym commented Oct 19, 2022

vmapetr commented Oct 25, 2022

BenH-Puregym commented Oct 31, 2022

lkt82 commented Nov 1, 2022 •

edited

Loading

BenH-Puregym commented Nov 18, 2022

darren-mcdonald commented Jan 20, 2023

ericyew commented May 3, 2023

yys2000 commented Jun 21, 2023

ericyew commented Jul 4, 2023

github-actions bot commented Dec 31, 2023

ericyew commented Jan 14, 2024

andrewhaine1 commented Apr 5, 2024

Az8th commented May 2, 2024

LeaCCC commented May 24, 2024

Releases failing randomly with agent error "We stopped hearing from agent" #3994

Releases failing randomly with agent error "We stopped hearing from agent" #3994

Comments

BenH-Puregym commented Oct 17, 2022

vmapetr commented Oct 17, 2022

BenH-Puregym commented Oct 18, 2022

vmapetr commented Oct 18, 2022

BenH-Puregym commented Oct 19, 2022

vmapetr commented Oct 25, 2022

BenH-Puregym commented Oct 31, 2022

lkt82 commented Nov 1, 2022 • edited Loading

BenH-Puregym commented Nov 18, 2022

darren-mcdonald commented Jan 20, 2023

ericyew commented May 3, 2023

yys2000 commented Jun 21, 2023

ericyew commented Jul 4, 2023

github-actions bot commented Dec 31, 2023

ericyew commented Jan 14, 2024

andrewhaine1 commented Apr 5, 2024

Az8th commented May 2, 2024

LeaCCC commented May 24, 2024

lkt82 commented Nov 1, 2022 •

edited

Loading