[INVESTIGATION] Unusual Network Spikes and Disk Activity on ezid-stg Environment #837

adambuttrick · 2025-02-18T17:53:54Z

Summary:
On 2024/02/18, ezid-stg experienced abnormal traffic with two distinct network spikes followed by disk I/O issues. Despite one target becoming unhealthy, the service remained available because of load balancer redundancy.

Timeline and Observations:

7:25 AM: First significant network traffic spike
8:05 AM: Second network traffic spike
9:09 AM: Sharp increase in I/O wait times and disk usage
9:13 AM: Nagios alert triggered: ezid-stg/ezid-stg_tg_UnhealthyHosts_5x12 CRITICAL

Investigation Needed:

Analyze server logs during the 7:25-9:09 time span to identify traffic sources
Determine traffic source and pattern of the network spikes, as well as if the disk activity is directly resulted from this
Review if any internal processes were running on stage during the I/O spike
Check if similar patterns have occurred previously at lower intensities

Mitigation:
Consider moving ezid-stg behind our VPN, similar to our dev environment.

sfisher · 2025-02-18T20:38:16Z

I'm having a hard time determining what could have caused this since all logs, traffic and other thing don't seem to indicate anything abnormal around the 9:00-9:30 timeframe when we saw the alert. Looked into both opensearch logs, reports and journalctl for system daemon tasks. I do not know why we were having disk queues and apparently swap space usage on stg01 around that time (according to Librato). Traffic was quite low then and there were a couple of blocked requests and not-blocked requests from bots, but that happens constantly. I also don't see any abnormal processes in the daemons from the logging in journalctl.

Perhaps there was something else happening at the system level? I don't see any indication of abnormal app operation around that time, or perhaps some kind of memory leak using too much memory? 🤷‍♂

adambuttrick · 2025-03-04T18:28:49Z

Per team discussion, alerts and automated intervention functioned as expected. A health check occurs once every minute. In the event of an an unhealthy host, it then conducts more frequent checks (every 10 secs). If those checks fail three times, the host is flagged as unhealthy, returning to health if they return correctly five times. This sequence and resolution was observed for this incident. So long as we do not see a recurrence that degrades all hosts (vs. falls back with high availability implementation), no action is needed.

adambuttrick · 2025-03-17T22:42:38Z

Nagios alerts show possible recurrence on ezid-stg on 2025/03/17 at 15:32 :

ezid-stg/ezid-stg_tg_UnhealthyHosts_5x12 is CRITICAL:
CRITICAL - Name=TargetGroup,Value=targetgroup/uc3-ezidui-stg-tg/8edf57e4e5c09191 Name=LoadBalancer,Value=app/uc3-ezidui-stg-alb/fbd5f47e4269fca8 UnHealthyHostCount (1 min Maximum): 1.000000000 Count - VALUE is wrong. It SHOULD BE inside the range {0 ... 0}

Need to inspect logs to see if a similar traffic spike occurred/triggered. Health check passed 10 mins later at 2025/03/17 at 15:42

jsjiang · 2025-03-18T21:41:45Z

System status on March 17 around 3:30pm:

ezidui-stg: CPU, memory and network I/O usages were very low
ezid-stg-rds: CPU usage was very low; only 1 read and 1 write activities
OpenSearch log:
- there was a spike 2210 requests in 5min around 3:25pm
- the requests were mainly from two IP addresses
  - 34.57.7.70: Google Datacenter in Iowa - /contact
  - 34.28.58.52: Google Datacenter in Iowa - /search
- We had other spikes such as 6K requests in 5 min around 9:30am on March 14. But the requests were mainly on identifiers but not on /contact or /search

We may want to add more restrictions to access to the /contact and /search endpoints.

Requests in the past 7 days showing a few spikes. The 3/17 3:30pm one was relatively low.

Requests on March 17 - most requests were from two IPs on /contact and /search:

jsjiang · 2025-03-18T22:37:30Z

Sample searches from 34.28.58.52


"user_agent": "Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36 OPR/56.0.3051.52",

"request": "GET https://35.162.220.79:443/search?publisher=http%3A%2F%2Fwww.pypi.org%2F&object_type=PhysicalObject&creator=test%40email.com&title=Hello%20World&pubyear_from=1982&keywords=FrAmE30&filtered=t&identifier=John8212&pubyear_to=test%40email.com&id_type=doi HTTP/1.1",

Decoded URL:

https://35.162.220.79:443/search?publisher=http://www.pypi.org/&object_type=PhysicalObject&[email protected]&title=Hello World&pubyear_from=1982&keywords=FrAmE30&filtered=t&identifier=John8212&[email protected]&id_type=doi

"request": "GET https://35.162.220.79:443/search?publisher=56&object_type=PhysicalObject&creator=test%40email.com&title=Hello%20World&pubyear_from=1982&keywords=%3C%3Fxml%20version%3D%221.0%22%20encoding%3D%22ISO-8859-1%22%3F%3E%3C%21DOCTYPE%20xxe_test%20%5B%20%3C%21ENTITY%20xxe_test%20SYSTEM%20%22http%3A%2F%2Fw3af.org%2Fxxe.txt%22%3E%20%5D%3E%3Ckeywords%3E%26xxe_test%3B%3C%2Fkeywords%3E&filtered=t&identifier=John8212&pubyear_to=test%40email.com&id_type=doi HTTP/1.1",

Decoded URL:

https://35.162.220.79:443/search?publisher=56&object_type=PhysicalObject&[email protected]&title=Hello World&pubyear_from=1982&keywords=<?xml version="1.0" encoding="ISO-8859-1"?><!DOCTYPE xxe_test [ <!ENTITY xxe_test SYSTEM "http://w3af.org/xxe.txt"> ]><keywords>&xxe_test;</keywords>&filtered=t&identifier=John8212&[email protected]&id_type=doi

adambuttrick added incident A summary of an incident - what happened and how we handled it Infrastructure VMs, network and other infrastructure activities labels Feb 18, 2025

adambuttrick closed this as completed Mar 4, 2025

adambuttrick reopened this Mar 17, 2025

marisastrong added the troubleshooting Operational issues that require troubleshooting and future reference label Mar 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[INVESTIGATION] Unusual Network Spikes and Disk Activity on ezid-stg Environment #837

[INVESTIGATION] Unusual Network Spikes and Disk Activity on ezid-stg Environment #837

adambuttrick commented Feb 18, 2025

sfisher commented Feb 18, 2025 •

edited

Loading

adambuttrick commented Mar 4, 2025

adambuttrick commented Mar 17, 2025 •

edited

Loading

jsjiang commented Mar 18, 2025 •

edited

Loading

jsjiang commented Mar 18, 2025 •

edited

Loading

[INVESTIGATION] Unusual Network Spikes and Disk Activity on ezid-stg Environment #837

[INVESTIGATION] Unusual Network Spikes and Disk Activity on ezid-stg Environment #837

Comments

adambuttrick commented Feb 18, 2025

sfisher commented Feb 18, 2025 • edited Loading

adambuttrick commented Mar 4, 2025

adambuttrick commented Mar 17, 2025 • edited Loading

jsjiang commented Mar 18, 2025 • edited Loading

jsjiang commented Mar 18, 2025 • edited Loading

sfisher commented Feb 18, 2025 •

edited

Loading

adambuttrick commented Mar 17, 2025 •

edited

Loading

jsjiang commented Mar 18, 2025 •

edited

Loading

jsjiang commented Mar 18, 2025 •

edited

Loading