-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sites not responding because of high thread count #72
Comments
Had 2 sites go down today, both had high thread counts (500+ and 100+) on the w3wp process, but the Looking further, we have the |
BTW On the second site that was not responding, I saw the thread count go up from 100 to 160 in a couple of minutes. I've also seen some weird things with log files, where the latest files were not in As we clean all log4net files older than 3 months in a background task ( |
Hi @ronaldbarendse thank you for reporting your issue. |
Looks like it isn't the So it might be the |
Hi @ronaldbarendse that would be awesome, we are very interested in any issues you may encounter so we can help prevent it in the future. |
After a lot of testing, we isolated the problem to having |
@ronaldbarendse that is interesting. We use PostProcessor on all our sites. Can you post updates if you get it fixed in PostProcessor? |
@sniffdk also experienced this issue (site unavailable and thread count at 572), but isn't using We've recently also experienced IIS Express sometimes randomly stops when deveoping sites locally, so it might be introduced in a recent Umbraco version. All speculation, but maybe all this can be tied together and point in the right direction. @KennethJakobsen: is it possible for the support team to check all active requests/threads when this issue occurs, so we at least know what's running in all these threads (e.g. using |
Ideally, just to make sure it's not us (Core or Cloud) doing something weird, I would love to get a minidump of a site with 500+ (or even 100+) threads. Just ping me if you can do it. Thanks! |
@zpqrtbnk I've contacted Eric because we have a site that's still in development go down today (500+ threads): we agreed he would let the developers have a look and not restart it. So talk to him and please go ahead with making a minidump! |
I've reproduced the crashing IIS Express process (multiple times, on multiple PCs and with different Umbraco projects): just start the website in debug mode (F5), let the site start up and show the frontpage, resave the
Looking at the thread specified in the error, it's running the following code every time: We also noticed the process crash most of the time after changing document types or doing a deploy (using the marker file in @zpqrtbnk Can you try to reproduce this? It might be related, don't you think? |
@zpqrtbnk The problems above are caused after updating |
We're still experiencing sites going down randomly (all with high thread counts) and have already excluded the ImageProcessor Umbraco support reported on May 2nd they were seeing an increase in 503 errors because of an issue in their server setup and were working on a fix. The next scheduled maintenance was on May 13th (https://status.umbraco.io/incidents/t3vhcwkvz029), but that either did not include the fix or at least not one that fixed the problem. Also note this isn't related to the incident on May 3rd (https://status.umbraco.io/incidents/1wwq67wmx7q0), as that's related to an Azure DNS issue (R50C-5RZ). Looking further, I noticed the
As they're both wrapped in the |
I'm unsure the thread usages when using the loggers we are adding, but I'm very interested in what your findings. On Cloud we do run some forced transforms, but if you'd like to, you could try to roll with the following log4net.config file in your site. It will prevent the customer loggers from being added, and only run with what is defined. the following could be used:
You can test it - but also a disclaimer that it is totally unsupported, as files will no longer be written async, and the "error logs" feature in the Cloud portal will also no longer work. Any chance you could test it out? Also do you have any sites running on version 8? And know if it has the same issues? we switched to Serilog when doing Version 8. |
After having major troubles with a website heavily using Examine, I've seen the thread count go up as soon as a re-index started (almost all threads were in waiting state). Especially when the re-indexing takes a while (e.g. when the site/server is under load or has a lot of items), the thread count keeps increasing and the requests time-out (probably because it waits for the index to become available and that takes too long). Looking at the Examine source, the And according to the Microsoft documentation:
So this could be the cause of the increading threads (in waiting state) while the site isn't responding... I've now added Not sure what causes the actual time-outs though, but I've seen lots of
@Shazwazza Any thoughts on this? |
I'll try this out when I can! I've not seen an overview of all forced transforms in the Cloud documentation (already talked about this with @sofietoft, but she needs input from developers first) or a way to disable these... Going from your example, the log4net.config transform adds the
All problems are with Umbraco 7 sites (7.12.4 - 7.14.0). Version 8 might solve these problems (different logger, updated Examine/Lucence version, etc.), but in testing we've found a lot of problems and are waiting for these to be fixed in 8.1... |
Hi @ronaldbarendse thanks for the info - i can try to see if i can determine if that is causing any issues. The thing is that this code has been like this for many many years without reports of issue but stranger things have happened - and we all probably know a lot more than we did many years ago ;) Also, why is it re-indexing and doing so during heavy load? Is this normal re-indexing or something custom? |
@ronaldbarendse have added this to examine repo so i can track it there Shazwazza/Examine#125 |
I started the re-index as a last resort, as it might have been corrupted with all the environment restarts, The randomness of this issue makes it very hard to debug/diagnose... As of now, my guess would be this is something in the logging (as UC has a custom logger: |
FYI: I've found other possible component we use (and may be specific to our sites) that might be responsible for the waiting/blocking threads: logzio/logzio-dotnet#34. |
@ronaldbarendse I've done some research and added notes here to the Examine question Shazwazza/Examine#125 (comment) As it is now i can't see this being an issue with the Examine queue. Regarding this
Yes unfortunately in v7, if indexes are not there and they need to be rebuilt the application will block until they are done because indexes are semi critical to looking up media on the front-end. In v8 this is not the case. I'm unsure if this index rebuilding you are referring to is on startup but if it is and you have a busy site then yes request threads will continue to increase until they are served and they wont be served until startup completes. It's not ideal but that's why media is no longer cached in lucene in v8 and index rebuilding if it's needed is queued to a background thread after the application starts. You'll definitely not want to run |
I just wanted to update this issue, as we're still seeing sites go down at random. As HQ has now added monitoring and automatically restarts unresponsive sites within 20 minutes, the downtime is at least kept to a minimum. As I don't have the permissions to create a minidump of a unresponsive site on UC (and we don't see this issue occurring outside of Umbraco Cloud, e.g. not locally and on other - more traditional - hosting setups), the root issue of all these waiting threads is still unknown. |
We've had multiple sites to down because of a IIS 'full hang': All requests to your application are very slow or time out. Symptoms include detectable request queueing, and sometimes 503 Service Unavailable errors when queue limits are reached.
The KUDU portal is still reachable when the site is not responding and looking in the process explorer, the thread count is 500+ (normal running sites have around 45-50 threads).
These hangs started in the beginning of March and seem to be related to the ImageProcessor
trimCache
setting that is now automatically transformed totrue
in the Umbraco Cloud deploy script. After manually setting this back tofalse
(which is the default for over 2 years in the Umbraco CMS codebase: https://github.com/umbraco/Umbraco-CMS/blob/853087a75044b814df458457dc9a1f778cc89749/build/NuSpecs/tools/cache.config.install.xdt), none of the sites had any downtime or high thread counts. After publishing/deploying an update to the site, this change is undone and therefore this is not a good work-around.Reproduction
Have
trimCache=true
in the ImageProcessor cache configuration (config\imageprocessor\cache.config
), request images that must be processed to trigger the cache trimming and wait for the thread count go to the limit, so IIS doesn't have available threads to process new incoming requests...I've come this far, but as I can't easilly reproduce the issue, am not sure this is exactly what is causing the downtime. If this can be verified, the issue should be added to the ImageProcessor project...
For now, it would be nice to:
The text was updated successfully, but these errors were encountered: