Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[INVESTIGATION] Unusual Network Spikes and Disk Activity on ezid-stg Environment #837

Open
adambuttrick opened this issue Feb 18, 2025 · 5 comments
Labels
incident A summary of an incident - what happened and how we handled it Infrastructure VMs, network and other infrastructure activities troubleshooting Operational issues that require troubleshooting and future reference

Comments

@adambuttrick
Copy link

Summary:
On 2024/02/18, ezid-stg experienced abnormal traffic with two distinct network spikes followed by disk I/O issues. Despite one target becoming unhealthy, the service remained available because of load balancer redundancy.

Timeline and Observations:

  • 7:25 AM: First significant network traffic spike
  • 8:05 AM: Second network traffic spike
  • 9:09 AM: Sharp increase in I/O wait times and disk usage
  • 9:13 AM: Nagios alert triggered: ezid-stg/ezid-stg_tg_UnhealthyHosts_5x12 CRITICAL

Investigation Needed:

  1. Analyze server logs during the 7:25-9:09 time span to identify traffic sources
  2. Determine traffic source and pattern of the network spikes, as well as if the disk activity is directly resulted from this
  3. Review if any internal processes were running on stage during the I/O spike
  4. Check if similar patterns have occurred previously at lower intensities

Mitigation:
Consider moving ezid-stg behind our VPN, similar to our dev environment.

@adambuttrick adambuttrick added incident A summary of an incident - what happened and how we handled it Infrastructure VMs, network and other infrastructure activities labels Feb 18, 2025
@sfisher
Copy link
Contributor

sfisher commented Feb 18, 2025

I'm having a hard time determining what could have caused this since all logs, traffic and other thing don't seem to indicate anything abnormal around the 9:00-9:30 timeframe when we saw the alert. Looked into both opensearch logs, reports and journalctl for system daemon tasks. I do not know why we were having disk queues and apparently swap space usage on stg01 around that time (according to Librato). Traffic was quite low then and there were a couple of blocked requests and not-blocked requests from bots, but that happens constantly. I also don't see any abnormal processes in the daemons from the logging in journalctl.

Perhaps there was something else happening at the system level? I don't see any indication of abnormal app operation around that time, or perhaps some kind of memory leak using too much memory? 🤷‍♂

Image

Image

Image

Image

Image

Image

@adambuttrick
Copy link
Author

Per team discussion, alerts and automated intervention functioned as expected. A health check occurs once every minute. In the event of an an unhealthy host, it then conducts more frequent checks (every 10 secs). If those checks fail three times, the host is flagged as unhealthy, returning to health if they return correctly five times. This sequence and resolution was observed for this incident. So long as we do not see a recurrence that degrades all hosts (vs. falls back with high availability implementation), no action is needed.

@adambuttrick
Copy link
Author

adambuttrick commented Mar 17, 2025

Nagios alerts show possible recurrence on ezid-stg on 2025/03/17 at 15:32 :

ezid-stg/ezid-stg_tg_UnhealthyHosts_5x12 is CRITICAL:
CRITICAL - Name=TargetGroup,Value=targetgroup/uc3-ezidui-stg-tg/8edf57e4e5c09191 Name=LoadBalancer,Value=app/uc3-ezidui-stg-alb/fbd5f47e4269fca8 UnHealthyHostCount (1 min Maximum): 1.000000000 Count - VALUE is wrong. It SHOULD BE inside the range {0 ... 0}

Need to inspect logs to see if a similar traffic spike occurred/triggered. Health check passed 10 mins later at 2025/03/17 at 15:42

@adambuttrick adambuttrick reopened this Mar 17, 2025
@jsjiang
Copy link
Contributor

jsjiang commented Mar 18, 2025

System status on March 17 around 3:30pm:

  • ezidui-stg: CPU, memory and network I/O usages were very low
  • ezid-stg-rds: CPU usage was very low; only 1 read and 1 write activities
  • OpenSearch log:
    • there was a spike 2210 requests in 5min around 3:25pm
    • the requests were mainly from two IP addresses
      • 34.57.7.70: Google Datacenter in Iowa - /contact
      • 34.28.58.52: Google Datacenter in Iowa - /search
    • We had other spikes such as 6K requests in 5 min around 9:30am on March 14. But the requests were mainly on identifiers but not on /contact or /search

We may want to add more restrictions to access to the /contact and /search endpoints.

Requests in the past 7 days showing a few spikes. The 3/17 3:30pm one was relatively low.

Image

Image

Requests on March 17 - most requests were from two IPs on /contact and /search:

Image

Image

Image

@jsjiang
Copy link
Contributor

jsjiang commented Mar 18, 2025

Sample searches from 34.28.58.52


"user_agent": "Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36 OPR/56.0.3051.52",

"request": "GET https://35.162.220.79:443/search?publisher=http%3A%2F%2Fwww.pypi.org%2F&object_type=PhysicalObject&creator=test%40email.com&title=Hello%20World&pubyear_from=1982&keywords=FrAmE30&filtered=t&identifier=John8212&pubyear_to=test%40email.com&id_type=doi HTTP/1.1",

Decoded URL:

https://35.162.220.79:443/search?publisher=http://www.pypi.org/&object_type=PhysicalObject&[email protected]&title=Hello World&pubyear_from=1982&keywords=FrAmE30&filtered=t&identifier=John8212&[email protected]&id_type=doi

"request": "GET https://35.162.220.79:443/search?publisher=56&object_type=PhysicalObject&creator=test%40email.com&title=Hello%20World&pubyear_from=1982&keywords=%3C%3Fxml%20version%3D%221.0%22%20encoding%3D%22ISO-8859-1%22%3F%3E%3C%21DOCTYPE%20xxe_test%20%5B%20%3C%21ENTITY%20xxe_test%20SYSTEM%20%22http%3A%2F%2Fw3af.org%2Fxxe.txt%22%3E%20%5D%3E%3Ckeywords%3E%26xxe_test%3B%3C%2Fkeywords%3E&filtered=t&identifier=John8212&pubyear_to=test%40email.com&id_type=doi HTTP/1.1",

Decoded URL:

https://35.162.220.79:443/search?publisher=56&object_type=PhysicalObject&[email protected]&title=Hello World&pubyear_from=1982&keywords=<?xml version="1.0" encoding="ISO-8859-1"?><!DOCTYPE xxe_test [ <!ENTITY xxe_test SYSTEM "http://w3af.org/xxe.txt"> ]><keywords>&xxe_test;</keywords>&filtered=t&identifier=John8212&[email protected]&id_type=doi

@marisastrong marisastrong added the troubleshooting Operational issues that require troubleshooting and future reference label Mar 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
incident A summary of an incident - what happened and how we handled it Infrastructure VMs, network and other infrastructure activities troubleshooting Operational issues that require troubleshooting and future reference
Projects
None yet
Development

No branches or pull requests

4 participants