-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[INVESTIGATION] Unusual Network Spikes and Disk Activity on ezid-stg Environment #837
Comments
I'm having a hard time determining what could have caused this since all logs, traffic and other thing don't seem to indicate anything abnormal around the 9:00-9:30 timeframe when we saw the alert. Looked into both opensearch logs, reports and journalctl for system daemon tasks. I do not know why we were having disk queues and apparently swap space usage on stg01 around that time (according to Librato). Traffic was quite low then and there were a couple of blocked requests and not-blocked requests from bots, but that happens constantly. I also don't see any abnormal processes in the daemons from the logging in journalctl. Perhaps there was something else happening at the system level? I don't see any indication of abnormal app operation around that time, or perhaps some kind of memory leak using too much memory? 🤷♂ |
Per team discussion, alerts and automated intervention functioned as expected. A health check occurs once every minute. In the event of an an unhealthy host, it then conducts more frequent checks (every 10 secs). If those checks fail three times, the host is flagged as unhealthy, returning to health if they return correctly five times. This sequence and resolution was observed for this incident. So long as we do not see a recurrence that degrades all hosts (vs. falls back with high availability implementation), no action is needed. |
Nagios alerts show possible recurrence on ezid-stg on 2025/03/17 at 15:32 :
Need to inspect logs to see if a similar traffic spike occurred/triggered. Health check passed 10 mins later at 2025/03/17 at 15:42 |
System status on March 17 around 3:30pm:
We may want to add more restrictions to access to the Requests in the past 7 days showing a few spikes. The 3/17 3:30pm one was relatively low. Requests on March 17 - most requests were from two IPs on /contact and /search: |
Sample searches from 34.28.58.52
|
Summary:
On 2024/02/18, ezid-stg experienced abnormal traffic with two distinct network spikes followed by disk I/O issues. Despite one target becoming unhealthy, the service remained available because of load balancer redundancy.
Timeline and Observations:
Investigation Needed:
Mitigation:
Consider moving ezid-stg behind our VPN, similar to our dev environment.
The text was updated successfully, but these errors were encountered: