Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Releases failing randomly with agent error "Unhandled exception happened in worker" #3855

Closed
zerowebcorp opened this issue Jun 2, 2022 · 13 comments

Comments

@zerowebcorp
Copy link

zerowebcorp commented Jun 2, 2022

Our pipelines are failing randomly with this error message produced by the ADO Agent

Agent package linux-x64.
Running on Linux (X64).
RuntimeInformation: Linux 5.4.0-1078-azure 81~18.04.1-Ubuntu SMP Mon Apr 25 23:16:13 UTC 2022.
Running as container on Kubernetes (AKS)
Azure DevOps Type and Version: dev.azure.com

[2022-06-02 11:31:28Z INFO RSAFileKeyManager] Loading RSA key parameters from file /azp/.credentials_rsaparams
[2022-06-02 11:31:28Z INFO MessageListener] Message '52' received from session 'ba0c5d02-28f2-4b4b-850b-42348f2fdbb2'.
[2022-06-02 11:31:28Z INFO JobDispatcher] Job request 615941 for plan c54e7a4e-ad11-46cc-9bc0-0c73e0cbd070 job 53e656e4-982d-505a-f1c4-7615adf3d251 received.
[2022-06-02 11:31:28Z INFO JobDispatcher] Job request 16a257d5-bebc-5958-3a7b-b4edf1fea8ba processed succeed.
[2022-06-02 11:31:28Z INFO Terminal] WRITE LINE: 2022-06-02 11:31:28Z: Running job: Agent job
[2022-06-02 11:31:28Z INFO AgentServer] Refresh JobRequest VssConnection to get on a different AFD node.
[2022-06-02 11:31:28Z INFO AgentServer] Establish connection with 30 seconds timeout.
[2022-06-02 11:31:28Z INFO VisualStudioServices] Starting operation Location.GetConnectionData
[2022-06-02 11:31:28Z INFO VisualStudioServices] Finished operation Location.GetConnectionData
[2022-06-02 11:31:28Z INFO JobDispatcher] Start renew job request 615941 for job 53e656e4-982d-505a-f1c4-7615adf3d251.
[2022-06-02 11:31:28Z INFO JobDispatcher] Successfully renew job request 615941, job is valid till 06/02/2022 11:41:28
[2022-06-02 11:31:28Z INFO HostContext] Well known directory 'Bin': '/azp/bin'
[2022-06-02 11:31:28Z INFO ProcessInvokerWrapper] Starting process:
[2022-06-02 11:31:28Z INFO ProcessInvokerWrapper]   File name: '/azp/bin/Agent.Worker'
[2022-06-02 11:31:28Z INFO ProcessInvokerWrapper]   Arguments: 'spawnclient 17 134'
[2022-06-02 11:31:28Z INFO ProcessInvokerWrapper]   Working directory: '/azp/bin'
[2022-06-02 11:31:28Z INFO ProcessInvokerWrapper]   Require exit code zero: 'False'
[2022-06-02 11:31:28Z INFO ProcessInvokerWrapper]   Encoding web name:  ; code page: ''
[2022-06-02 11:31:28Z INFO ProcessInvokerWrapper]   Force kill process on cancellation: 'True'
[2022-06-02 11:31:28Z INFO ProcessInvokerWrapper]   Redirected STDIN: 'False'
[2022-06-02 11:31:28Z INFO ProcessInvokerWrapper]   Persist current code page: 'False'
[2022-06-02 11:31:28Z INFO ProcessInvokerWrapper]   Keep redirected STDIN open: 'False'
[2022-06-02 11:31:28Z INFO ProcessInvokerWrapper]   High priority process: 'True'
[2022-06-02 11:31:28Z INFO ProcessInvokerWrapper] Process started with process id 1515, waiting for process exit.
[2022-06-02 11:31:28Z INFO JobDispatcher] Send job request message to worker for job 53e656e4-982d-505a-f1c4-7615adf3d251 (70 KB).
[2022-06-02 11:31:40Z INFO JobDispatcher] Took 11318 ms to send job message to worker
[2022-06-02 11:31:40Z INFO JobNotification] Entering JobStarted Notification
[2022-06-02 11:31:40Z INFO JobNotification] Entering StartMonitor
[2022-06-02 11:31:44Z INFO ProcessInvokerWrapper] STDOUT/STDERR stream read finished.
[2022-06-02 11:31:44Z INFO ProcessInvokerWrapper] STDOUT/STDERR stream read finished.
[2022-06-02 11:31:44Z INFO ProcessInvokerWrapper] Exited process 1515 with exit code 1
[2022-06-02 11:31:44Z INFO ProcessInvokerWrapper] Finished process 1515 with exit code 1, and elapsed time 00:00:15.5187047.
[2022-06-02 11:31:44Z INFO JobDispatcher] Worker finished for job 53e656e4-982d-505a-f1c4-7615adf3d251. Code: 1
[2022-06-02 11:31:44Z INFO JobDispatcher] Return code 1 indicate worker encounter an unhandled exception or app crash, attach worker stdout/stderr to JobRequest result.
[2022-06-02 11:31:44Z INFO VisualStudioServices] Starting operation Location.GetConnectionData
[2022-06-02 11:31:44Z INFO VisualStudioServices] Finished operation Location.GetConnectionData
[2022-06-02 11:31:44Z INFO JobDispatcher] finish job request for job 53e656e4-982d-505a-f1c4-7615adf3d251 with result: Failed
[2022-06-02 11:31:44Z INFO Terminal] WRITE LINE: 2022-06-02 11:31:44Z: Job Agent job completed with result: Failed
[2022-06-02 11:31:44Z INFO JobDispatcher] Stop renew job request for job 53e656e4-982d-505a-f1c4-7615adf3d251.
[2022-06-02 11:31:44Z INFO JobDispatcher] job renew has been canceled, stop renew job request 615941.
[2022-06-02 11:31:44Z ERR  JobDispatcher] Unhandled exception happened in worker:
[2022-06-02 11:31:44Z ERR  JobDispatcher] Newtonsoft.Json.JsonReaderException: Invalid character after parsing property name. Expected ':' but got: m. Path 'requestId', line 1, position 32006.
   at Newtonsoft.Json.JsonTextReader.ParseProperty()
   at Newtonsoft.Json.Serialization.JsonSerializerInternalReader.PopulateObject(Object newObject, JsonReader reader, JsonObjectContract contract, JsonProperty member, String id)
   at Newtonsoft.Json.Serialization.JsonSerializerInternalReader.CreateObject(JsonReader reader, Type objectType, JsonContract contract, JsonProperty member, JsonContainerContract containerContract, JsonProperty containerMember, Object existingValue)
   at Newtonsoft.Json.Serialization.JsonSerializerInternalReader.CreateValueInternal(JsonReader reader, Type objectType, JsonContract contract, JsonProperty member, JsonContainerContract containerContract, JsonProperty containerMember, Object existingValue)
   at Newtonsoft.Json.Serialization.JsonSerializerInternalReader.Deserialize(JsonReader reader, Type objectType, Boolean checkAdditionalContent)
   at Newtonsoft.Json.JsonSerializer.DeserializeInternal(JsonReader reader, Type objectType)
   at Newtonsoft.Json.JsonSerializer.Deserialize(JsonReader reader, Type objectType)
   at Newtonsoft.Json.JsonSerializer.Deserialize[T](JsonReader reader)
   at Microsoft.VisualStudio.Services.WebApi.JsonUtility.FromString[T](String toDeserialize, JsonSerializerSettings settings)
   at Microsoft.VisualStudio.Services.Agent.Worker.Worker.RunAsync(String pipeIn, String pipeOut)
   at Microsoft.VisualStudio.Services.Agent.Worker.Program.MainAsync(IHostContext context, String[] args)
Error reported in diagnostic logs. Please examine the log for more details.
    - /azp/_diag/Worker_20220602-113138-utc.log
[2022-06-02 11:31:44Z INFO JobNotification] Entering JobCompleted Notification
[2022-06-02 11:31:44Z INFO JobNotification] Entering EndMonitor

File: /azp/_diag/Worker_20220602-113138-utc.log

root@azure-devops-agent-5857df4ff-6d4bj:/azp# cat /azp/_diag/Worker_20220602-113630-utc.log
[2022-06-02 11:36:30Z INFO Program] Version: 2.204.0
[2022-06-02 11:36:30Z INFO Program] Commit: 166abe8635999375bb798213066ccafbdd7e3875
[2022-06-02 11:36:30Z INFO Program] Culture:
[2022-06-02 11:36:30Z INFO Program] UI Culture:
[2022-06-02 11:36:31Z INFO HostContext] Well known directory 'Bin': '/azp/bin'
[2022-06-02 11:36:31Z INFO HostContext] Well known directory 'Root': '/azp'
[2022-06-02 11:36:31Z INFO HostContext] Well known config file 'Proxy': '/azp/.proxy'
[2022-06-02 11:36:31Z INFO VstsAgentWebProxy] No proxy setting found.
[2022-06-02 11:36:31Z INFO HostContext] Well known directory 'Bin': '/azp/bin'
[2022-06-02 11:36:31Z INFO HostContext] Well known directory 'Root': '/azp'
[2022-06-02 11:36:31Z INFO HostContext] Well known config file 'Certificates': '/azp/.certificates'
[2022-06-02 11:36:31Z INFO AgentCertificateManager] No certificate setting found.
[2022-06-02 11:36:31Z INFO Worker] Waiting to receive the job message from the channel.
[2022-06-02 11:36:31Z INFO Worker] Message received.
[2022-06-02 11:36:35Z ERR  Program] Newtonsoft.Json.JsonReaderException: Invalid character after parsing property name. Expected ':' but got: m. Path 'requestId', line 1, position 32006.
   at Newtonsoft.Json.JsonTextReader.ParseProperty()
   at Newtonsoft.Json.Serialization.JsonSerializerInternalReader.PopulateObject(Object newObject, JsonReader reader, JsonObjectContract contract, JsonProperty member, String id)
   at Newtonsoft.Json.Serialization.JsonSerializerInternalReader.CreateObject(JsonReader reader, Type objectType, JsonContract contract, JsonProperty member, JsonContainerContract containerContract
, JsonProperty containerMember, Object existingValue)
   at Newtonsoft.Json.Serialization.JsonSerializerInternalReader.CreateValueInternal(JsonReader reader, Type objectType, JsonContract contract, JsonProperty member, JsonContainerContract containerC
ontract, JsonProperty containerMember, Object existingValue)
   at Newtonsoft.Json.Serialization.JsonSerializerInternalReader.Deserialize(JsonReader reader, Type objectType, Boolean checkAdditionalContent)
   at Newtonsoft.Json.JsonSerializer.DeserializeInternal(JsonReader reader, Type objectType)
   at Newtonsoft.Json.JsonSerializer.Deserialize(JsonReader reader, Type objectType)
   at Newtonsoft.Json.JsonSerializer.Deserialize[T](JsonReader reader)
   at Microsoft.VisualStudio.Services.WebApi.JsonUtility.FromString[T](String toDeserialize, JsonSerializerSettings settings)
   at Microsoft.VisualStudio.Services.Agent.Worker.Worker.RunAsync(String pipeIn, String pipeOut)
   at Microsoft.VisualStudio.Services.Agent.Worker.Program.MainAsync(IHostContext context, String[] args)

It is not clear from the error what is wrong. The same agent is able to build some builds which tells me that there is no issue with the PAT token or the injected environment variables. This is running as a container in AKS

@KonstantinTyukalov
Copy link
Contributor

Hi @getvivekv thanks for reporting! We are working on more prioritized issues at the moment, but will get back to this one soon.

@markatky46
Copy link

markatky46 commented Jul 28, 2022

Any resolution or idea why this is happening. Same problem here. Self Hosted agent via self hosted kubernetes

Only happens intermittent. This particular job successfully completed on attempt 3. Sometimes it goes through first time. Sometimes it fails 3 attempts and pipeline falls over.

@markatky46
Copy link

##[error]Newtonsoft.Json.JsonReaderException: Invalid character after parsing property name. Expected ':' but got: m. Path 'jobContainer', line 1, position 30728.
at Newtonsoft.Json.JsonTextReader.ParseProperty()
at Newtonsoft.Json.Serialization.JsonSerializerInternalReader.PopulateObject(Object newObject, JsonReader reader, JsonObjectContract contract, JsonProperty member, String id)
at Newtonsoft.Json.Serialization.JsonSerializerInternalReader.CreateObject(JsonReader reader, Type objectType, JsonContract contract, JsonProperty member, JsonContainerContract containerContract, JsonProperty containerMember, Object existingValue)
at Newtonsoft.Json.Serialization.JsonSerializerInternalReader.CreateValueInternal(JsonReader reader, Type objectType, JsonContract contract, JsonProperty member, JsonContainerContract containerContract, JsonProperty containerMember, Object existingValue)
at Newtonsoft.Json.Serialization.JsonSerializerInternalReader.Deserialize(JsonReader reader, Type objectType, Boolean checkAdditionalContent)
at Newtonsoft.Json.JsonSerializer.DeserializeInternal(JsonReader reader, Type objectType)
at Newtonsoft.Json.JsonSerializer.Deserialize(JsonReader reader, Type objectType)
at Newtonsoft.Json.JsonSerializer.Deserialize[T](JsonReader reader)
at Microsoft.VisualStudio.Services.WebApi.JsonUtility.FromString[T](String toDeserialize, JsonSerializerSettings settings)
at Microsoft.VisualStudio.Services.WebApi.JsonUtility.FromString[T](String toDeserialize)
at Microsoft.VisualStudio.Services.Agent.Worker.Worker.RunAsync(String pipeIn, String pipeOut)
at Microsoft.VisualStudio.Services.Agent.Worker.Program.MainAsync(IHostContext context, String[] args)
Error reported in diagnostic logs. Please examine the log for more details.
- /azp/_diag/Worker_20220707-062214-utc.log
,##[error]We stopped hearing from agent linuxagent-nao-7d67dd7db8-jnlnj. Verify the agent machine is running and has a healthy network connection. Anything that terminates an agent process, starves it for CPU, or blocks its network access can cause this error. For more information, see: https://go.microsoft.com/fwlink/?linkid=846610
Started: 7 Jul at 07:34
Duration: <1s

@BenH-Puregym
Copy link

I'm also running self-hosted agents on AKS and getting the exact same thing reported here: sometimes it succeeds first time and other times you need to run 2-3 times for it to succeed because you keep getting the dreaded We stopped hearing from agent. It seems to appear randomly on unrelated stages doing a whole range of different tasks.

@KonstantinTyukalov
Copy link
Contributor

Hi @getvivekv @BenH-Puregym @markatky46 could you please share more information on this approach:

  • Agent version in which the problem occurred
  • minimal YAML definition and AKS configuration which can help us reproduce the problem

If you have any private information, please send it to [email protected].

Thanks

@BenH-Puregym
Copy link

@KonstantinTyukalov, always pulls the latest, it just uses this script at the time of writing it's using 2.210.1

unsure what i can provide on your second point because the agent is just built on linux using the script i mentioned above with some added extras and then that's deployed to AKS as a Keda deployment. In devops it's completely random, there's no specific task which triggers it. Could be building a site, running a script or even sending a http request.
let me know if you need anything else.

@mmrazik
Copy link
Contributor

mmrazik commented Sep 15, 2022

@BenH-Puregym how often are you able to reproduce this? Is there something special about network configuration that you have?

@Jenp
Copy link

Jenp commented Sep 15, 2022

Hi all,
@mmrazik and the team are working on debugging this issue however we are having difficulty reproducing. We believe there is a truncation of the JSON payload which is causing the error based on what we can see from the Azure DevOps Pipeline side. Anyone able to work with us on a shared debugging effort?

@mmrazik
Copy link
Contributor

mmrazik commented Sep 20, 2022

@BenH-Puregym, @getvivekv We will need some help here from you on your specific setup as we can't reproduce the issue despite trying for several days. Looking at the code we are using standard and very widely used library to serialize/deserialize the json and I don't expect the issue be there. More likely there is something in your environment that triggers this and we would need more info (e.g. are you using ScaledObject or ScaledJobs). Based on the log snippets it almost looks like the http payload has been truncated.
It would be best if you could guide me through your environment during a call. Feel free to drop me an e-mail when/whom to include to martinmrazik (at) microsoft (dot) com

@zerowebcorp
Copy link
Author

zerowebcorp commented Sep 20, 2022

@mmrazik My team reports that they are no longer seeing this error after we had the outage on Azure which forced us to reboot all the nodes on the Kubernetes cluster. The team also moved several of the critical pipeline to use a VM based agent as this was a blocker for us for several weeks. Currently it is not reproducible. The agents were running on AKS with Istio service mesh with mTLS enabled. The pipeline had a kubectl task, helm chart task and a cmd task.

@mmrazik
Copy link
Contributor

mmrazik commented Sep 20, 2022

Ok. I am closing this for now.

@mmrazik mmrazik closed this as not planned Won't fix, can't repro, duplicate, stale Sep 20, 2022
@BenH-Puregym
Copy link

This is still happening I'm afraid. Didn't happen for a couple weeks and now we've had multiple instances of it happen in the last week. I've been able to capture some more information this time.

So when a job finishes devops should delete the agent so that we have short lived agents... By the looks of the logs on this occurrence it seems when the job finished instead of deleting immediately like it does on other agents (i can see this by looking at the logs) it failed over and over again to delete the agent and eventually succeeds. In the meantime another job on a totally different pipeline decides to pick the same agent and of course once the agent is deleted we get the We stopped hearing from agent error on the new job.
Logs from failed agent:

09:21:55.410
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:21:55.410363969Z stdout F 2022-10-11 09:21:55Z: Running job: Build app
 
09:30:19.771
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:19.765584908Z stdout F 2022-10-11 09:30:19Z: Job Build app completed with result: Succeeded
 
09:30:19.985
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:19.985180004Z stdout F Agent exit code 0
 
09:30:19.985
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:19.98571461Z stdout F �[1;36mCleanup. Removing Azure Pipelines agent...�[0m
 
09:30:20.885
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:20.787535925Z stdout F Removing agent from the server
 
09:30:21.743
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:21.499530418Z stdout F Connecting to server ...
 
09:30:22.140
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:22.137644272Z stdout F Error reported in diagnostic logs. Please examine the log for more details.
 
09:30:22.140
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:22.137702573Z stdout F     - /azp/_diag/Agent_20221011-093020-utc.log
 
09:30:22.172
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:22.156237083Z stdout F Failed: Removing agent from the server
 
09:30:22.178
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:22.178472936Z stderr F Agent "ldevops-rqmlf--1-8hzwp" is running a job for pool "ubuntu-pool"
 
09:30:22.227
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:22.227043288Z stdout F Retrying in 30 seconds...


09:30:53.019
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:53.019535413Z stdout F Removing agent from the server
 
09:30:53.741
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:53.741424587Z stdout F Connecting to server ...
 
09:30:54.258
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:54.254130792Z stdout F Error reported in diagnostic logs. Please examine the log for more details.
 
09:30:54.258
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:54.254202393Z stdout F     - /azp/_diag/Agent_20221011-093052-utc.log
 
09:30:54.270
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:54.270176374Z stdout F Failed: Removing agent from the server
 
09:30:54.290
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:54.289914798Z stderr F Agent "ldevops-rqmlf--1-8hzwp" is running a job for pool "ubuntu-pool"
 
09:30:54.309
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:30:54.309693622Z stdout F Retrying in 30 seconds...
 
09:31:25.027
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:31:25.027092346Z stdout F Removing agent from the server
 
09:31:25.652
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:31:25.652210432Z stdout F Connecting to server ...
 
 
09:31:26.192
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:31:26.191664061Z stdout F Error reported in diagnostic logs. Please examine the log for more details.
 
09:31:26.192
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:31:26.191757062Z stdout F     - /azp/_diag/Agent_20221011-093124-utc.log
 
09:31:26.210
devtools-ne-aks
ldevops-rqmlf--1-8hzwp
2022-10-11T09:31:26.21034287Z stdout F Failed: Removing agent from the server

@LeaCCC
Copy link

LeaCCC commented May 24, 2024

this issue still happens with self-hosted windows agent. I just logged my bug here: #4813

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants