Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In certain cases UI test hangs indefinitely instead of fails (OSOE-880) #387

Closed
sarahelsaig opened this issue Jul 8, 2024 · 16 comments
Closed
Labels
bug Something isn't working

Comments

@sarahelsaig
Copy link
Member

sarahelsaig commented Jul 8, 2024

First I've experienced this on OCC during the OC 2.0 update, so I assumed it's related to OC 2.0 somehow. But now it happened with TDEAL as well. (I still don't know what hanged OCC, I know that for TDEAL it's just the same visual verification Chrome footer thing we had on OSOCE and elsewhere)

Jira issue

@github-actions github-actions bot changed the title In certain cases UI test hangs indefinitely instead of fails In certain cases UI test hangs indefinitely instead of fails (OSOE-880) Jul 8, 2024
@Piedone Piedone added the bug Something isn't working label Jul 8, 2024
@Piedone
Copy link
Member

Piedone commented Jul 8, 2024

Related/same? #228 (and Lombiq/Open-Source-Orchard-Core-Extensions#736).

Did you try configuring dotnet-test-process-timeout so the timeout is handled for the test process instead of the whole workflow? (Or even the per-test timeouts.)

@sarahelsaig
Copy link
Member Author

Related/same? #228 (and Lombiq/Open-Source-Orchard-Core-Extensions#736).

The examples I mentioned happened on Ubuntu and restarting did not fix them, so I don't think #228 is related.

Did you try configuring dotnet-test-process-timeout so the timeout is handled for the test process instead of the whole workflow? (Or even the per-test timeouts.)

Thanks, I will try that.

@Piedone
Copy link
Member

Piedone commented Jul 9, 2024

Despite the title of that issue, this happened many times under Ubuntu too (but first it seemed it's Windows-only). But yeah, that's about random hangs, not consistent ones. That look more like an app-specific issue.

@sarahelsaig
Copy link
Member Author

Adding dotnet-test-process-timeout magically fixed the problem in OCC (OrchardCMS/OrchardCore.Commerce#454). What does that mean?

@Piedone
Copy link
Member

Piedone commented Jul 10, 2024

image

This can fix the issue if control gets to this line:

https://github.com/Lombiq/GitHub-Actions/blob/813f2ed0586dce428250d502377e34c17884f2b7/.github/actions/test-dotnet/Invoke-SolutionTests.ps1#L215

Because with this, if the tests complete but the process hangs, the run can succeed. However, the telltale message is not in the output, so this didn't run, and the previous run you linked wasn't hanging after the test run has completed (i.e. all tests produced their outputs) but somewhere before that.

BTW there are a huge number of exceptions in the workflow output, I suggest checking these out, e.g.:

2024-07-10T00:14:22.5934506Z  2024-07-10 00:13:55.5118|Default|00-0b8735f889ef785eacdbe7443da03d71-387064e610ec3783-00||Microsoft.AspNetCore.Diagnostics.DeveloperExceptionPageMiddleware|WARN|The response has already started, the error page middleware will not be executed. 
2024-07-10T00:14:22.5939726Z  2024-07-10 00:13:55.5118|Default|00-8b46da17ed6f9ef1aabca0f86ea1be6c-c7c3e103a1df7f90-00||Microsoft.AspNetCore.Diagnostics.DeveloperExceptionPageMiddleware|ERROR|An unhandled exception has occurred while executing the request. System.InvalidOperationException: Two concurrent threads have been detected accessing the same ISession instance from: 
...

And a lot more, even Shouldly ones, so this should've really failed (though completed).

@sarahelsaig
Copy link
Member Author

Don't question it.

I would never. All praises to the Omnissiah!!

BTW there are a huge number of exceptions in the workflow output, I suggest checking these out, e.g.:

Yes, that's why I said that the test hangs instead of fails. (btw on TDEAL it only hangs with the dev build that uses the standard runner, in PRs if a test fails the run correctly stops) This dotnet-test-process-timeout is really useful, because now I can see the errors (unlike previously) and I can address them.

@Piedone
Copy link
Member

Piedone commented Jul 10, 2024

I see. Perhaps this is not actually a hang, then? But rather, it runs retries of a lot of tests, which is slow, and it just times out? With 6 hours in TDEAL that would be extreme, but not impossible (on the slow 2-core HDD default runner; OCC, as public repos, use 4-core SSD runners by default).

@sarahelsaig
Copy link
Member Author

I'm 100% certain it's actually a hang. With TDEAL I know that only the visual verification test failed. Just one test. The same run on buildjet only took 12.5 minutes. So if it took up to an hour with the standard runner I could stomach that, but 6 hours is not possible.

@Piedone
Copy link
Member

Piedone commented Jul 10, 2024

Then I guess it can hang due to some threads deadlocking with just the two cores. We've seen issues like before, and an ASP.NET Core sync-over-async issue, still unfixed (somewhere linked in the other issue I linked), can cause this.

@Piedone
Copy link
Member

Piedone commented Jul 10, 2024

Nothing else to do here then, though?

@Piedone
Copy link
Member

Piedone commented Jul 10, 2024

As part of NEST-501 I also experience this: the DotNest UI tests didn't produce any output here and the workflow timed out after an hour. After adding dotnet-test-process-timeout: 600000 I could see the actual failing test.

@sarahelsaig
Copy link
Member Author

So you also had problems with security scanning. In OCC as well, after filtering out the expected error testing, all other error logs were from the full security scan saying "InvalidOperationException System.InvalidOperationException: Two concurrent threads have been detected accessing the same ISession instance". I've removed the dotnet-test-process-timeout and temporarily disabled the test and now the run passed (with no |ERROR| in log) in 8 minutes. I think the security test is accidentally stress testing the runner or YesSql's thread safety by it's starting many requests (nearly) concurrently.

@Piedone
Copy link
Member

Piedone commented Jul 11, 2024

It's expected for the security scan to start concurrent requests, though the goal is not stress testing and I believe the rate can be adjusted if necessary. However, concurrent requests mustn't result in such a YesSql exception: that only happens if two threads use the same ISession, not simply access the DB independently. This is something to avoid. Each request uses its own ISession, so concurrent requests in itself shouldn't cause this (unless some singleton service keeps using the same ISession under multiple requests, what should be avoided too).

@Piedone
Copy link
Member

Piedone commented Jul 15, 2024

Under NEST-501 I discovered that security scanning maxing out the CPU (and maybe also RAM) of the runner can cause dotnet test to hang. You can try Lombiq/GitHub-Actions#370 to see CPU and RAM metrics of runs and try to correlate that with issues you encounter (as a run with an oversaturated CPU can do all kinds of funny things).

I still don't think there's a general issue here, maybe some documentation.

@Piedone
Copy link
Member

Piedone commented Jul 17, 2024

And if there is a general issue, we should get back to #228.

@Piedone
Copy link
Member

Piedone commented Aug 2, 2024

So, closing, since there's no new general issue.

@Piedone Piedone closed this as not planned Won't fix, can't repro, duplicate, stale Aug 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants