Skip to content

Commit

Permalink
Disable SNS notifications for alarm that was noisier than intended
Browse files Browse the repository at this point in the history
DTADFIC, a queue of 100 items can sometimes not send any work to hadoop
  • Loading branch information
abought committed Aug 14, 2024
1 parent 5a84913 commit a1a872b
Showing 1 changed file with 6 additions and 1 deletion.
7 changes: 6 additions & 1 deletion modules/imputation-server/monitoring.tf
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
/// This alarm is a useful idea in theory, but it's noisy, and doesn't operate on the timescales we need
// It also can't account for the dual-queue design of cloudgene (the "active" vs "queued" feature): if 15 jobs
// are exporting, then even though there are 100 jobs in queue, hadoop won't be sent work, and will signal "all clear"
// No amount of alarm cleverness can compensate for a webapp that hides information from the system, which makes it hard to fix alarm just from the AWS side.
// We'll keep the alarm defined and tracking metrics, in case it aids future capacity planning. But it won't send alerts.
resource "aws_cloudwatch_metric_alarm" "cluster_needs_resources" {
# Warn if the system is unable to scale enough, after several hours of trying. Resolved by:
# a) add spot capacity (if we're blitzed with lots of jobs),
Expand All @@ -13,7 +18,7 @@ resource "aws_cloudwatch_metric_alarm" "cluster_needs_resources" {
datapoints_to_alarm = 24
evaluation_periods = 24

actions_enabled = true
actions_enabled = false # Don't send alerts- see notes.

# Notify when the alarm changes state- for good or bad
alarm_actions = [var.alert_sns_arn]
Expand Down

0 comments on commit a1a872b

Please sign in to comment.