Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[metrics]: Not able to go into a firing state when VPN tunnel is down for VPN-tunnel and ALB, CLB, & NLB #639

Open
rajualap opened this issue Jan 19, 2024 · 1 comment
Labels
metrics-configuration How to configure specific metrics for collection

Comments

@rajualap
Copy link

rajualap commented Jan 19, 2024

Hi ,

I have created a 'cloudwatch-exporter.yml' file to fetch metrics from CloudWatch for RDS, Lambda, VPN-tunnel, ALB, CLB, and NLB. We are successfully obtaining metrics for RDS and Lambda, and on Prometheus, we can see RDS and Lambda metrics. However, when there is an issue with RDS and Lambda, alert rules go into a firing state and generate alerts. Unfortunately, we are not receiving alerts for VPN-tunnel and ALB, CLB, & NLB. Can you please help with identifying the reason? Below, you'll find the 'cloudwatch-exporter.yml' file and alert rules.

Please assist in resolving this issue

cloudwatch-exporter.yml file here :-

region: ap-south-1
metrics:
  - aws_namespace: AWS/RDS
    aws_metric_name: BurstBalance
    aws_dimensions: [DBInstanceIdentifier]
    aws_statistics: [Average]

  - aws_namespace: AWS/RDS
    aws_metric_name: FreeableMemory
    aws_dimensions: [DBInstanceIdentifier]
    aws_statistics: [Average]

  - aws_namespace: AWS/RDS
    aws_metric_name: CPUUtilization
    aws_dimensions: [DBInstanceIdentifier]
    aws_statistics: [Average]

  - aws_namespace: AWS/RDS
    aws_metric_name: DatabaseConnections
    aws_dimensions: [DBInstanceIdentifier]
    aws_statistics: [Average]

  - aws_namespace: AWS/Lambda
    aws_metric_name: Duration
    aws_dimensions: [FunctionName]
    aws_statistics: [Average]

  - aws_namespace: AWS/Lambda
    aws_metric_name: Errors
    aws_dimensions: [FunctionName]
    aws_statistics: [Sum]

  - aws_namespace: AWS/Lambda
    aws_metric_name: Invocations
    aws_dimensions: [FunctionName]
    aws_statistics: [Sum]

  - aws_namespace: AWS/ElasticLoadBalancing
    aws_metric_name: UnHealthyHostCount
    aws_dimensions: [LoadBalancerName]
    aws_statistics: [Average]

  - aws_namespace: AWS/ElasticLoadBalancing
    aws_metric_name: RequestCount
    aws_dimensions: [LoadBalancerName]
    aws_statistics: [Sum]

  - aws_namespace: AWS/VPN
    aws_metric_name: TunnelState
    aws_dimensions: [VpnId]
    aws_statistics: [Average]

####################################

Prometheus VPNtunnel alerts file here 👎

groups:
  - name: VPNAlerts
    rules:
      # Alert if the average VPN tunnel state is less than 1 (indicating down) for 5 minutes
      - alert: VPNDownCritical
        expr: aws_vpn_tunnel_state_average < 1
        for: 5m
        labels:
          severity: critical
        annotations:
          LABELS: '{{ $labels }}'
          VALUE: '{{ $value }}'
          summary: 'VPN Tunnel Down Critical'
          description: 'At least one VPN tunnel is down.'

      # Alert if the average VPN tunnel state is less than 1 for 1 minute
      - alert: VPNDownWarning
        expr: aws_vpn_tunnel_state_average < 1
        for: 1m
        labels:
          severity: warning
        annotations:
          LABELS: '{{ $labels }}'
          VALUE: '{{ $value }}'
          summary: 'VPN Tunnel Down Warning'
          description: 'At least one VPN tunnel is down.'

      # Alert if there are changes in VPN tunnel state indicating flapping for 5 minutes
      - alert: VPNFlapping
        expr: changes(aws_vpn_tunnel_state_average[5m]) > 1
        for: 5m
        labels:
          severity: critical
        annotations:
          LABELS: '{{ $labels }}'
          VALUE: '{{ $value }}'
          summary: 'VPN Tunnel Flapping'
          description: 'At least one VPN tunnel is experiencing flapping.'

Cloudwatch Metrics here

image010

@rajualap rajualap added the metrics-configuration How to configure specific metrics for collection label Jan 19, 2024
@rajualap rajualap changed the title [metrics]: short description here [metrics]: Not able to go into a firing state when VPN tunnel is down for VPN-tunnel and ALB, CLB, & NLB Jan 19, 2024
@matthiasr
Copy link
Contributor

What does aws_vpn_tunnel_state_average look like in the /metrics endpoint? What does it look like in the Prometheus graph and table views?

It seems that you are using the default delay_seconds and set_timestamp. This means the metrics are not visible to an instant query in Prometheus "now", as your rules are using – see the documentation for details.

Try min_over_time(aws_vpn_tunnel_state_average[15m]) < 1 and changes(aws_vpn_tunnel_state_average[30m]) > 1 to look back further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
metrics-configuration How to configure specific metrics for collection
Projects
None yet
Development

No branches or pull requests

2 participants