Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Releases failing randomly with agent error "We stopped hearing from agent" #3994

Closed
BenH-Puregym opened this issue Oct 17, 2022 · 17 comments
Closed
Assignees
Labels
Area: Agent Kubernetes Issues related to AKS, KEAD, etc. stale

Comments

@BenH-Puregym
Copy link

As per this issue which is now closed. #3855

This is still happening I'm afraid. Didn't happen for a couple weeks and now we've had multiple instances of it happen in the last week. I've been able to capture some more information this time.

So when a job finishes devops should delete the agent so that we have short lived agents... By the looks of the logs on this occurrence it seems when the job finished instead of deleting immediately like it does on other agents (i can see this by looking at the logs) it failed over and over again to delete the agent and eventually succeeds. In the meantime another job on a totally different pipeline decides to pick the same agent and of course once the agent is deleted we get the We stopped hearing from agent error on the new job.
Logs from failed agent:

09:21:55.410
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:21:55.410363969Z stdout F 2022-10-11 09:21:55Z: Running job: Build app
 
09:30:19.771
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:19.765584908Z stdout F 2022-10-11 09:30:19Z: Job Build app completed with result: Succeeded
 
09:30:19.985
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:19.985180004Z stdout F Agent exit code 0
 
09:30:19.985
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:19.98571461Z stdout F �[1;36mCleanup. Removing Azure Pipelines agent...�[0m
 
09:30:20.885
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:20.787535925Z stdout F Removing agent from the server
 
09:30:21.743
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:21.499530418Z stdout F Connecting to server ...
 
09:30:22.140
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:22.137644272Z stdout F Error reported in diagnostic logs. Please examine the log for more details.
 
09:30:22.140
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:22.137702573Z stdout F     - /azp/_diag/Agent_20221011-093020-utc.log
 
09:30:22.172
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:22.156237083Z stdout F Failed: Removing agent from the server
 
09:30:22.178
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:22.178472936Z stderr F Agent "ldevops-rqmlf--1-8hzwp" is running a job for pool "ubuntu-pool"
 
09:30:22.227
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:22.227043288Z stdout F Retrying in 30 seconds...


09:30:53.019
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:53.019535413Z stdout F Removing agent from the server
 
09:30:53.741
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:53.741424587Z stdout F Connecting to server ...
 
09:30:54.258
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:54.254130792Z stdout F Error reported in diagnostic logs. Please examine the log for more details.
 
09:30:54.258
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:54.254202393Z stdout F     - /azp/_diag/Agent_20221011-093052-utc.log
 
09:30:54.270
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:54.270176374Z stdout F Failed: Removing agent from the server
 
09:30:54.290
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:54.289914798Z stderr F Agent "ldevops-rqmlf--1-8hzwp" is running a job for pool "ubuntu-pool"
 
09:30:54.309
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:54.309693622Z stdout F Retrying in 30 seconds...
 
09:31:25.027
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:31:25.027092346Z stdout F Removing agent from the server
 
09:31:25.652
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:31:25.652210432Z stdout F Connecting to server ...
 
 
09:31:26.192
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:31:26.191664061Z stdout F Error reported in diagnostic logs. Please examine the log for more details.
 
09:31:26.192
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:31:26.191757062Z stdout F     - /azp/_diag/Agent_20221011-093124-utc.log
 
09:31:26.210
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:31:26.21034287Z stdout F Failed: Removing agent from the server
@vmapetr
Copy link
Contributor

vmapetr commented Oct 17, 2022

Hi @BenH-Puregym!
Could you please confirm - is it correct, that agents which don't get purged after the job execution, are stuck despite the job outcome, or initially they have the jobs with the same Invalid character after parsing property name error from the parent issue?

@BenH-Puregym
Copy link
Author

Hi @vmapetr , Yes that's right, I haven't seen this Invalid character after parsing property name error.

@vmapetr
Copy link
Contributor

vmapetr commented Oct 18, 2022

@BenH-Puregym so it seems the issue is not coming from the communication between the agent and AzDO this time. Could you please clarify - where the cleanup logs that you mentioned in the issue description are coming from? AFAIK, the AzureDevops itself is not managing the agent orchestration in the scope of complete agent deletion, so seems like you got this from the AKS scaler or KEDA, right?
And also, there is a possibility to extract mentioned diag logs from the agents that fail to clean up before the agent is completely purged?

@BenH-Puregym
Copy link
Author

We have enrolled the cluster into new relic to get these logs.
We use Keda to watch the queue and deploy/delete the jobs when devops mark them as completion but ofc it's not down to keda to delete the agent from devops, this is what happens in the script MS provide

I have added the --once flag to the script as recommended in this scenario so at the end of a job it should delete the agent which notifies K8 that the container has completed.

@KonstantinTyukalov KonstantinTyukalov added the Kubernetes Issues related to AKS, KEAD, etc. label Oct 24, 2022
@vmapetr
Copy link
Contributor

vmapetr commented Oct 25, 2022

@BenH-Puregym From what it seems, the agent has some intermittent network issues, which is expected, but while working with the --once flag we facing the issue when ADO itself does not know that agent is preparing for cleanup. We are working on a solution right now.
in a meantime, could you please provide the content of the /azp/_diag/ folder from the failed machine? By those logs, we can provide a temporary workaround.

@BenH-Puregym
Copy link
Author

@vmapetr that's really great to know that it's a problem with ADO rather than us.
I'd love to provide the content you're asking for but it's one of those annoying issues that sometimes happen multiple times a day and then might not happen for 1-2 weeks.
Isn't that folder going to display the same as what I sent above or what you see in the pipeline output?

@lkt82
Copy link

lkt82 commented Nov 1, 2022

Hi we am seeing this behavior on a AKS/KEDA setup as well.

For us it's easily provoked by querying 20+ pipelines runs

@BenH-Puregym
Copy link
Author

Hi @vmapetr, have you had much luck in finding the cause? We're still getting the error multiple times a day every so often.

@darren-mcdonald
Copy link

Hi @vmapetr has there been any progress on this?

@ericyew
Copy link

ericyew commented May 3, 2023

Any update on a fix for this?

@yys2000
Copy link

yys2000 commented Jun 21, 2023

We also randomly experienced this issue when we used the self-hosted container app build agent.

@ericyew
Copy link

ericyew commented Jul 4, 2023

@BenH-Puregym From what it seems, the agent has some intermittent network issues, which is expected, but while working with the --once flag we facing the issue when ADO itself does not know that agent is preparing for cleanup. We are working on a solution right now. in a meantime, could you please provide the content of the /azp/_diag/ folder from the failed machine? By those logs, we can provide a temporary workaround.

@vmapetr is this still being looked at? Any solution coming for this?

Copy link

This issue has had no activity in 180 days. Please comment if it is not actually stale

@ericyew
Copy link

ericyew commented Jan 14, 2024

This is not resolved yet

@andrewhaine1
Copy link

I also encounter this randomly with ACA Container App Jobs. I thought that perhaps AKS would be the solution as there would be a lot more flexibility in troubleshooting this problem, but it seems as if the same would be the case on AKS as @lkt82 has pointed out.

@Az8th
Copy link

Az8th commented May 2, 2024

#4313 (comment)
This may fix the "We stopped hearing from agent" error !

@LeaCCC
Copy link

LeaCCC commented May 24, 2024

this issue still happens with self-hosted windows agent. I just logged my bug here: #4813

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Agent Kubernetes Issues related to AKS, KEAD, etc. stale
Projects
None yet
Development

No branches or pull requests