Skip to content

Conversation

@TimDiekmann
Copy link
Member

🌟 What is the purpose of this PR?

Add RDS I/O performance monitoring CloudWatch alarms to detect potential database performance issues before they impact users.

🔍 What does this change?

  • Adds a CloudWatch alarm for high disk queue depth (DiskQueueDepth > 10) to detect I/O bottlenecks
  • Adds a CloudWatch alarm for high read IOPS (ReadIOPS > 500) to monitor read-intensive workloads
  • Adds a CloudWatch alarm for high write IOPS (WriteIOPS > 500) to monitor write-intensive workloads
  • Configures all alarms with appropriate thresholds, evaluation periods, and notification settings
  • Sets severity levels (CRITICAL for disk queue depth, WARNING for IOPS metrics)

Pre-Merge Checklist 🚀

🚢 Has this modified a publishable library?

This PR:

  • does not modify any publishable blocks or libraries, or modifications do not need publishing

📜 Does this require a change to the docs?

The changes in this PR:

  • are internal and do not require a docs change

🕸️ Does this require a change to the Turbo Graph?

The changes in this PR:

  • do not affect the execution graph

🛡 What tests cover this?

  • Terraform plan validation will verify the syntax and structure of the new alarm resources

❓ How to test this?

  1. Apply the Terraform changes to a non-production environment
  2. Verify the alarms appear in CloudWatch console
  3. Optionally generate high I/O load to test alarm triggering

@github-actions github-actions bot added area/infra Relates to version control, CI, CD or IaC (area) area/infra > terraform labels Nov 3, 2025
@graphite-app graphite-app bot requested a review from a team November 3, 2025 19:40
@graphite-app
Copy link
Contributor

graphite-app bot commented Nov 3, 2025

Graphite Automations

"Request DevOps reviewers once CI passes" took an action on this PR • (11/03/25)

1 reviewer was added to this PR based on Tim Diekmann's automation.

@graphite-app graphite-app bot changed the base branch from t/sre-86-vanta-remediate-sql-database-freeable-memory-monitored-aws to graphite-base/8005 November 4, 2025 11:07
Adds DiskQueueDepth, ReadIOPS, and WriteIOPS alarms to satisfy Vanta
requirement for database I/O monitoring (SRE-79).

- DiskQueueDepth > 10 operations (CRITICAL)
- ReadIOPS > 500 IOPS (WARNING)
- WriteIOPS > 500 IOPS (WARNING)

All alarms use 3/5 datapoints evaluation for anti-flapping and send
notifications to PagerDuty via existing SNS topic.
Adds DatabaseConnections alarm to monitor connection pool exhaustion.

- DatabaseConnections > 180 (~80% of max 225) (CRITICAL)
- 2/3 datapoints evaluation over 15 minutes

Connection exhaustion can block all database access, making this
a critical metric to monitor.
@TimDiekmann TimDiekmann force-pushed the t/sre-87-vanta-remediate-database-io-monitored-aws branch from 489cd81 to f79246a Compare November 4, 2025 11:34
@graphite-app graphite-app bot changed the base branch from graphite-base/8005 to main November 4, 2025 11:35
@graphite-app
Copy link
Contributor

graphite-app bot commented Nov 4, 2025

Merge activity

  • Nov 4, 11:35 AM UTC: Graphite rebased this pull request, because this pull request is set to merge when ready.

@TimDiekmann TimDiekmann added this pull request to the merge queue Nov 4, 2025
Merged via the queue into main with commit 2808319 Nov 4, 2025
50 of 64 checks passed
@TimDiekmann TimDiekmann deleted the t/sre-87-vanta-remediate-database-io-monitored-aws branch November 4, 2025 19:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/infra > terraform area/infra Relates to version control, CI, CD or IaC (area)

Development

Successfully merging this pull request may close these issues.

3 participants