Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: High rate of "We stopped hearing from agent" errors for web-platform-tests. #4313

Open
1 of 4 tasks
jgraham opened this issue Jun 6, 2023 · 5 comments
Open
1 of 4 tasks

Comments

@jgraham
Copy link

jgraham commented Jun 6, 2023

What happened?

Since approximately May 16th, we've been experiencing a high failure rate for web-platform-tests jobs running on macOS 13. This appears to be an infrastructure issue as we get a message indicating that the agent stopped responding. This affects some, but not all jobs, and it appears to be random within set of jobs running similar workloads (chunks of the testsuite) on macOS. It doesn't appear to be a specific part of the workload (e.g. a specific testcase).

One of the first affected builds is: https://dev.azure.com/web-platform-tests/wpt/_build/results?buildId=100660. A recent one is https://dev.azure.com/web-platform-tests/wpt/_build/results?buildId=101901

Manually rerunning the failed jobs does work (but some jobs require multiple reruns, since the problem can also happen during the rerun)

We've tried to resolve the problem in the following ways:

  • Enabled automatic retries in the pipeline configuration. Either we got the configuration wrong, or these jobs are not retried.
  • Making each job smaller (i.e. run fewer tests per jobs). This didn't have any impact.
  • Testing on macOS-12 rather than 13. The problems started shortly after an update, but are apparently still reproducible on the older OS release (and using the latest version is important for our use case).

(cc @gsnedders who did most of the diagnosis work to date)

web-platform-tests/wpt#40085 is the corresponding wpt repository issue

Versions

macOS-13

Environment type (Please select at least one enviroment where you face this issue)

  • Self-Hosted
  • Microsoft Hosted
  • VMSS Pool
  • Container

Azure DevOps Server type

dev.azure.com (formerly visualstudio.com)

Azure DevOps Server Version (if applicable)

No response

Operation system

No response

Version controll system

No response

Relevant log output

##[error]We stopped hearing from agent Azure Pipelines 11. Verify the agent machine is running and has a healthy network connection. Anything that terminates an agent process, starves it for CPU, or blocks its network access can cause this error. For more information, see: https://go.microsoft.com/fwlink/?linkid=846610
Pool: Azure Pipelines
@DmitriiBobreshev
Copy link
Contributor

Hi @jgraham, thank you for the feedback, based on the error message, the issue is not related to the agent itself, but to the ms-hosted pool. Could you please create the issue in the runner-images repository?

Also, to speed up the process, you could create a ticket on dev community?

@Blue101black
Copy link

Hi @jgraham did you manage to get any resolution for this?

Windows-2022 we are having same issue. It's very annoying because it's inconsistent and a re-run doesn't always fix it.

@ryanps1
Copy link

ryanps1 commented Nov 7, 2023

Also experiencing this issue with the Microsoft Hosted Ubuntu Pools (I've tried them all)

@Az8th
Copy link

Az8th commented May 2, 2024

We had this problem occuring for several months, and it was fixed by simply turning off auto-updates for agents.

I caught the agent trying to download and install a previous version (the one packaged with its corresponding Azure DevOps version). It seems there is an undocumented behaviour about failing tasks that triggers a backup if the agent was downloaded through another source than Azure (like Github).

Hope it fixes your issue too ;)

@patrick-13x
Copy link

We had this problem occuring for several months, and it was fixed by simply turning off auto-updates for agents.

I caught the agent trying to download and install a previous version (the one packaged with its corresponding Azure DevOps version). It seems there is an undocumented behaviour about failing tasks that triggers a backup if the agent was downloaded through another source than Azure (like Github).

Hope it fixes your issue too ;)

How do you manage to turn off auto-updates on Azure DevOps Server 2022?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants