Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus metrics disappear in HA setup when all Vault pods are sealed #990

Open
cascadia-sati opened this issue Jan 9, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@cascadia-sati
Copy link

Describe the bug
I'm deploying an HA Vault setup in our Kubernetes cluster with three replicas. While working on monitoring for the seal status of the Vault pods, I noticed that the Prometheus metrics go away when all Vault pods are sealed, which makes it impossible to trigger an alert for this state.

This apparently happens, because the vault ServiceMonitor selects the vault-active Service, which in turn selects the Vault pod with the vault-active: "true" annotation. However, when all Vault pods are sealed, then they all have the vault-active: "false" annotation, which means the Service returns 503 when the ServiceMonitor attempts to fetch metrics.

To Reproduce
Simply configure Prometheus metrics and then seal all the Vault pods by restarting them

Expected behavior
We should be able to get metrics and monitor the seal state via the vault_core_unsealed metric even when all Vault pods are sealed.

We achieved this by removing vault-active: "true" from the ServiceMonitor matchLabels field and adding a new unique label both there and to the vault Service object. This ensure the ServiceMonitor uses only the vault Service object, which routes to the Vault pods regardless of their active status.

Environment

  • Kubernetes version: v1.26.9-eks-a5df82a
    • Distribution or cloud vendor (OpenShift, EKS, GKE, AKS, etc.): EKS
  • vault-helm version: 0.25.0

Chart values:

global:
  serverTelemetry:
    prometheusOperator: true
injector:
  enabled: false
server:
  ha:
    enabled: true
    replicas: 3
    # Enable HA for integrated storage
    raft:
      enabled: true
      setNodeId: true
      config: |
        ui = true

        listener "tcp" {
          tls_disable = 1
          address = "[::]:8200"
          cluster_address = "[::]:8201"

          # Enable unauthenticated metrics access for Prometheus Operator
          telemetry {
            unauthenticated_metrics_access = "true"
          }
        }

        telemetry {
          prometheus_retention_time = "30m"
          disable_hostname = true
        }

        storage "raft" {
          path = "/vault/data"
        }

        # For integrated raft storage and security
        # https://developer.hashicorp.com/vault/docs/configuration#disable_mlock
        disable_mlock = true

        service_registration "kubernetes" {}
  serverTelemetry:
    serviceMonitor:
      enabled: true
  dataStorage:
    enabled: true
    size: 5Gi
    storageClass: ebs-gp3
  affinity: |
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app.kubernetes.io/name: {{ template "vault.name" . }}
              app.kubernetes.io/instance: "{{ .Release.Name }}"
              component: server
          topologyKey: topology.kubernetes.io/zone
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant