-
Notifications
You must be signed in to change notification settings - Fork 268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wecsvc stops working after a while #35
Comments
I had/am having similar issues; about 2.5k stations and occasionally the collector just stops; the process is still running, but it's not doing anything. I've attempted to work around it with the registry changes in my PR (#34) and some additional tweaking after that but I still seem to be encountering the issue. |
Thank you for this input, did you every experiment the GPO feature to control event per second for each sources ? if yes does it have any impact on performance on customr side ? Did you used the MaxItems set to 1 ? |
@rewt1 I believe I set my max items per second to 10 and I don't recall seeing customer side issues. I haven't modified any of the delivery options, though I did add some suppressions to a couple of rules. We're running 2016 as our collector (VMWare, 8 CPU, 16GB, SSD SAN); I actually have three collectors setup, one for servers (about 350 clients), one for DCs (9 clients) and one for workstations (+2.5K clients), and the workstation one is the only one exhibiting this issue. Edit: Just a note: we are utilizing Winlogbeat to forward events from these subscriptions over to our ELK cluster. |
Can confirm this same issue was happening on our WEC servers. It happened on WECs with thousands of clients and WECs with just a few hundred. All our event subscriptions are consolidated to a single one for workstations, member servers, and DCs. During troubleshooting, we found that running gpupdate /force from the affected WEC would jumpstart the process and it would begin logging forwarded events again without a reboot. Running gpupdate again would cause it to go back into that funky state (sometimes). In the end, we doubled down on memory/CPU and haven't seen the issue come back. It's definitely buggy. But hey, it's also free ;) |
Thanks @jokezone for your feedback.
I am wondering if it could be possible to lower the number of parallel connections by "aggregating" the different subscriptions. I am going to perform tests to see if it reduce the volume of parallell connection on the WEC servers. Any other ideas are welcome ! |
@jokezone In my instance I don't think it's memory related (even though I've seen the process go as high as 6GB), but if I go higher in CPU I'm gonna start hitting NUMA boundaries on our hypervisors. I typically notice that during high intake, the CPU is maxed, but one the subscriptions seem to catch up the CPU hovers around 60%, so I'm not 100% sure what to make of that. Another thing I notice with high intake is that the WinRM queue gets full (C:\Windows\System32\LogsFiles\HTTPERR) but the queue size for WinRM is hard coded as 1000, or at least I haven't found a way to increase it. @bluedefxx I like the idea of combining some of the subscriptions; I actually utilize the additional channels (and wrote two of my own) so that I can tag different types of events within ELK for easier searching, but if we were to combine the subscriptions that would otherwise feed into the same channel (Authentication is one of them) then both those that utilize the custom channels, and those that write to forwarded events would be able to reduce our subscription counts (and subsequent load on the registry + WinRM) with no change in logging behavior. Edit: I've also played around with the idea of setting up a fourth collector for specific subscriptions, figuring that if I specify multiple collectors via GPO they'll report the requested subscriptions to those collectors; a slight increase in load on the client, but a potential massive decrease in load on a single server. We're playing a balancing game with VM resources ATM, so I haven't pursued this idea yet, as I wanted to see what I could figure out with a single box. |
I opened a case with Microsoft Premier Support due to similar issues that others have reported here. We consistently see stability issues with clients randomly failing to send events or the wecsvc stopping receiving events using the default delivery settings from the subscriptions provided here. I have also tried tweaking the delivery settings with different options and haven't had any of the changes improve the performance significantly. Microsoft's conclusion after analyzing multiple memory dumps, traces, event logs, etc was that the issue is due to the number of subscriptions on each Event Collection server:
I haven't had time to implement their recommendation yet, and am hesitant to implement because we are currently using the additional channels for tagging within ELK and easier filtering within Winlogbeat as well. |
@SpencerLN You spent money for this? Nice. It's just odd that I only seem to have the issue on my large collector; I mean, if that's the 'truth' from MS, perhaps my thinking of having additional collectors to separate subscriptions would ease the burden. I think combining the subscriptions that would otherwise write to the same channel might be the most 'efficient' thing to do... Now if only it didn't take 25 minutes for the event subscription GUI to load :) |
I'm going to try combining the authentication subscriptions into a single, combined one. Here's the xpath I'm going to use:
|
@novaksam I will be very interested in how your test of combining subscriptions goes. We have a rather large WEC footprint (~20 servers) and it has been a major hassle to manage, so we are starting to look at other options given the apparent instability of WEF at a higher scale. I am really hoping to be able to get WEF working, which resulted in opening the case with MSFT. I guess time will tell if their recommendation is able to make a big enough impact to make our client to WEC ratio work. |
Just wondering if you guys set the refresh interval for the subscription manager GPO?
This sets the frequency in seconds for how often the source computers query the collector for subscription updates. My environment is not nearly as large, but I have this set and have not experienced a stoppage in event forwarding. |
@SpencerLN Well it appears to have survived the night, and our scheduled 7AM wakeup, so I'd call that progress. Unfortunately, the only really 'combinable' subscriptions are for authentication (5; account lockouts, authentication, explicit-credentials, kerberos and NTLM), Windows diags (2; Event-log-diagnostics, windows diagnostics) and exploit guard (4), so this strategy can only get you so far (though it will decrease the number of active subscriptions by 8 (4-1-3) overall). I also have some suppressions I added to object manipulation because they were just too noisy.
@dstidham617 I think I have mine set to 180. |
@dstidham617 I do not actually have the Refresh interval configured. Does anyone know what the default Refresh value is? |
@SpencerLN I'm not certain, but the example I gave is straight from the GPO definition so I'd presume 60 seconds. |
@SpencerLN Just reporting back; simply combining the authentication subscriptions, while looking positive at first, did not resolve my problems. It seems like heavy intake periods are still killing it, even after I increase the batch items on object-manipulation to 50. I can still try to revise the exploit-guard subscriptions, since those are some heavy hitters too, but it's looking more and more like I have to move some subscriptions to another server. |
Just letting everyone know, I undid my combined subscription, because I believe it was preventing events from being collected; I was seeing a lot of kerberos, and not much else, which seemed to me that only the last 'Query' in the Xpath is actually executed; I could be wrong about this, and just misattributing one thing for another, but I just wanted to share. |
I'm going on vacation in a week, so since I'm still dealing with this problem, I wrote a powershell script and set it to run every 5 minutes with task scheduler
|
Thanks to all who've share their experiences with this. I'll try add a bit of my own.
Yes, also observed this, especially for collectors handling lots of workstations / member servers. One thing I've not seen mentioned yet (unless I missed it) is options to tune WinRM, which wecsvc depends on. Take a look at WinRM Service Default Configuration Settings MaxConnections defaults to 300 (or just 25). My guess is that with a lot of WEF clients, this limit might get in the way. In my experience, our WEF sever with lots of clients behaved poorly with this default and even remote powershell (also uses WinRM) was unresponsive. We never ran out of CPU or RAM, but did see the locked thread issue. Increasing the MaxConnections limit allowed us to move onto the next problem (system ran out of RAM). I'm not sure if each subscription requires a connection, or if clients are smart enough to use one connection for events from multiple subscriptions. @novaksam & @SpencerLN , also using winlogbeat and ran into a limitation there as well. We had a single Forwarded Events default channel collecting everything and found that the because of the way the winlogbeat agent uses the windows event API, it can't scale beyond a single thread. On our VMs, it couldn't process beyond ~2000 events per sec despite adding more vCPU. In winlogbeat config, we did a nasty hack of splitting it into multiple inputs by event IDs to force extra threads (one per input). Indeed, Palantir's approach of using multiple channels instead of a single forwarded events channel might well be a good approach to getting more input threads on winlogbeat. |
@JPvRiel Thank you for the updated info. Can I ask what values you had success configuring MaxConnections for? Did you hit some sort of upper limitation with how high you could set it before it began having a negative impact? |
@SpencerLN , TL;DR, eventually I think ran out of memory... With some very rough math/assumptions at play, seems to be ~2MB overhead per log source computer for WinRM + Wecsvc. But there might also be some or the other threading lock contention issue (TBD). I bumped WinRM up to 10000 connections max on servers with 2x subscriptions defined on each. winrm set winrm/config/service '@{MaxConnections="10000"}' Some success on one server, but not the other.
So limitation looks to be the number of subscribed computers and memory usage, not necessarily how many events come in. I'm also guessing there might be some or the other lock connection trying to handle too many connections. I've noticed this before the VM becomes unresponsive (Can't remote powershell or RDP): From metricbeat, I can see Given core windows services like WinRM and Wecsvc run as shared process services in svchost.exe, it's hard to isolate their resource usage. I did attempt to test and isolate with
|
@bluedefxx, I checked, and as far as I see, windows eventing does re-use the same WinRM session for multiple subscriptions (despite 2 subscriptions setup, I see only one connection for a given source computer), so splitting config into multiple subscriptions doesn't seem to add to my connection limit issue. |
@JPvRiel thanks for this, I answer in the name of @bluedefxx, thank you so much for this great analysis.
The amazing thing with the MaxConnection settings is that it is working without that but stops after a while (sometime 2 days, somewtimes 1 week...) despite the limit of 300 (or even 10000 is reached multiple times). This assumption is done by the fact that we have seen +80 000 ESTABLISHED (counting with netstat and ESTABLISHED on the WecServer itself) for +4000 workstations... So count on me to give a try to your setting but I don't understand how it could have an impact. (your new error on RAM is by the way a good indicator of "something" changing)... I don't understand why Microsoft does not improve the documentation on the WinRM parameter as i am sure many people are impacted... and it clearly looks like a bug. |
@rewt1 , thanks for validating/correcting some of my assumptions (perhaps my test with just 2 subscriptions for 1 source was too limited. E.g. it might have had only had events for one of the subscriptions and wasn't actively sending on the other...)
Given the above, the approach taken with Palantir splitting the collection into multiple subscriptions might be inefficient from the perspective of possibly overwhelming WinRM with too many connections? It'd painful to refactor the project to combine subscriptions/channels (I was about to split my own and now won't)
I've used this to inspect WinRM connections: $WinRMPortConnectionStates = Get-NetTCPConnection -LocalPort 5985
$WinRMPortConnectionStates | Group-Object -Property State | Select-Object -Property Name, Count | Format-Table
If so, that implies
So this is likely a bug in their WinRM service not applying the limit as documented. |
Hello guys, As far as I am concerned, the script of @novaksam solved my problem. Thank you 👍 BUT : If you use this script, make sure to isolate the service Wecsvc. By default it is shared with others such as "DNS Client" or "Network Location Awareness". So that if you restart all these services at once, you break the machine and the WinRM service is crashed. To isolate the wecsvc service :
And then restart the service Windows Event Collector. As @JPvRiel said, if you do that, the WEC service stops working. It is due to HTTP.SYS ACLs, which are broken when you isolate the wecsvc service from svchost. It is well explained in this KB : As the KB explained, you need to fix the ACL with the following commands after having isolated the wecsvc service from svchost :
After that, you need to restart both WinRM and Wecsvc and you should have a working and stable WEC service 😄 |
I just stumbled across this best practices for EventLog forwarding article from Microsoft. It touches on a lot of the topics discussed in this thread. It gives a recommendation of no more than 4k clients per collector server. Where other guidance has suggested it could support 10k (with the right amound of resources). |
Hey folks, So I upgraded to server 2022, and now the wec service will gobble up memory after running for an extended period. I've updated my monitor script to account for this:
|
Hello,
We are encountring a strange behavior on a Windows 2012 Event Collector.
This server use > 8vCPU and 20GB RAM, monitoring it does not show specific usage peaks.
NXLog is used to fetch logs and send them to a SIEM.
We are using parts of the wef-subscriptions (~40) on +4000 workstations. (customized to filter at Source).
For some (unknown) reasons the Event Collector stops "working" randomly after sometimes 2 days, 3 days , 7 days... By stops working i mean, the service is still running but no events are coming in the forwarded events...
Regarding deployed subscriptions, the only modifications we performed were:
25000
We also tried to perform some customozation recommended in:
"Windows™ Event Forwarding (WEF) into ArcSight at Scale"
We tried to correlate this issue with user activities but it does not seem to have any link, last stops to send log at 6 in the morning...
The analysis we performed shown the following behavior;
Sometimes we are able to restart the service, sometimes we need to kill it before restarting.
I have multiple questions on this:
Your configs use MaxItems set to 1, are we agreeed that performances should be better by buffering a little bit (as we are doing by setting 25 or 50) ?
The pro of multiple subscriptions is also to have the capacity to manage ACL separately but it also create 4000*40 registry keys to maintain states in the registry, could this have any impact ?
On the fact to have 40 dedicated subscriptions, can't this have any impact on parallel network connections from source initated ? (i mean opening 40 parallel tcp sockets instead of 1 only x 4000 ?)
This is a very interesting topic as large environment deplyment, tunning and troubleshooting are not well (not to say not at all) documented by MS...
Any feedbacks, ideas, and answers on this issue would be more than appreciated and i assume help the community !!!
Kind regards,
The text was updated successfully, but these errors were encountered: