Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add monitoring support for managed EKS WC #2832

Closed
13 tasks done
Tracked by #2817
TheoBrigitte opened this issue Sep 19, 2023 · 18 comments
Closed
13 tasks done
Tracked by #2817

Add monitoring support for managed EKS WC #2832

TheoBrigitte opened this issue Sep 19, 2023 · 18 comments

Comments

@TheoBrigitte
Copy link
Member

TheoBrigitte commented Sep 19, 2023

Create a managed EKS cluster from a EKS based CAPA MC (girzzly and golem) using the following guide.

Add support for managed EKS WC in our monitoring and alerting infrastructure.

Checks

Related issues:

@TheoBrigitte TheoBrigitte changed the title Evaluate monitoring on EKS based CAPA WC Evaluate monitoring on managed EKS WC Sep 19, 2023
@QuentinBisson QuentinBisson self-assigned this Sep 19, 2023
@QuentinBisson
Copy link

Started by upgrading the observability-bundle in default-apps-eks giantswarm/default-apps-eks#8.

I now have a cluster running on grizzly

@QuentinBisson
Copy link

Monitoring:
🔴 Prometheus is currently not able to list pods and so on:

ts=2023-09-20T10:52:22.949Z caller=klog.go:116 level=error component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231: Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"https://https/F5AB99CD4973C48F8F80414C4693ED04.gr7.eu-west-2.eks.amazonaws.com:443/api/v1/namespaces/kube-system/pods?limit=500&resourceVersion=0\": dial tcp: lookup https on 172.31.0.10:53: no such host"
ts=2023-09-20T10:52:23.772Z caller=klog.go:108 level=warn component=k8s_client_runtime func=Warningf msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231: failed to list *v1.Endpoints: Get \"https://https/F5AB99CD4973C48F8F80414C4693ED04.gr7.eu-west-2.eks.amazonaws.com:443/api/v1/namespaces/kube-system/endpoints?limit=500&resourceVersion=0\": dial tcp: lookup https on 172.31.0.10:53: no such host"
ts=2023-09-20T10:52:23.772Z caller=klog.go:116 level=error component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: Get \"https://https/F5AB99CD4973C48F8F80414C4693ED04.gr7.eu-west-2.eks.amazonaws.com:443/api/v1/namespaces/kube-system/endpoints?limit=500&resourceVersion=0\": dial tcp: lookup https on 172.31.0.10:53: no such host"
ts=2023-09-20T10:52:24.107Z caller=klog.go:108 level=warn component=k8s_client_runtime func=Warningf msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231: failed to list *v1.Pod: Get \"https://https/F5AB99CD4973C48F8F80414C4693ED04.gr7.eu-west-2.eks.amazonaws.com:443/api/v1/pods?limit=500&resourceVersion=0\": dial tcp: lookup https on 172.31.0.10:53: no such host"
ts=2023-09-20T10:52:24.107Z caller=klog.go:116 level=error component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231: Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"https://https/F5AB99CD4973C48F8F80414C4693ED04.gr7.eu-west-2.eks.amazonaws.com:443/api/v1/pods?limit=500&resourceVersion=0\": dial tcp: lookup https on 172.31.0.10:53: no such host"
ts=2023-09-20T10:52:24.395Z caller=klog.go:108 level=warn component=k8s_client_runtime func=Warningf msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231: failed to list *v1.Endpoints: Get \"https://https/F5AB99CD4973C48F8F80414C4693ED04.gr7.eu-west-2.eks.amazonaws.com:443/api/v1/endpoints?limit=500&resourceVersion=0\": dial tcp: lookup https on 172.31.0.10:53: no such host"
ts=2023-09-20T10:52:24.395Z caller=klog.go:116 level=error component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: Get \"https://https/F5AB99CD4973C48F8F80414C4693ED04.gr7.eu-west-2.eks.amazonaws.com:443/api/v1/endpoints?limit=500&resourceVersion=0\": dial tcp: lookup https on 172.31.0.10:53: no such host"

🟡 Scraping targets:
List of targets is currently empty, most likely related to the above issue

🟢 Alerting:
image

🟡 Alert status in Prometheus
A few are in pending, to be investigated

🟢 Grafana:
image

🟢 Grafana cloud:
image

@QuentinBisson
Copy link

QuentinBisson commented Sep 20, 2023

Status Update

Monitoring:
🟢 Prometheus fixed with giantswarm/prometheus-meta-operator#1375

🔴 Scraping targets:

🟡 Alerting
A few are in pending, to be investigated

@QuentinBisson
Copy link

Related giantswarm/default-apps-eks#10

@QuentinBisson
Copy link

Status Update

🟠 Scraping targets:
Prometheus on the MC cannot scrape cert-exporter, cert-manager-app and chart-operator because the api server proxy is either not configured properly on Managed EKS or not working at all. Fixing this would require a deep investigation from @giantswarm/team-phoenix.
As we are striving to move to service monitors, it would definitely be easier for everyone if we moved the failing apps to use service monitors instead:

🔴 Prometheus-operator and Kube-state-metrics are unstable
The pods keep getting recreated every 1 minute even though the pods are in a running state. We will need to dig deeper into it.

🔴 Alerting
list of alerts
The current list of alert is big but of those, only 3 are relevants because most of the alerts are related to the scraping errors above:

  • WorkloadClusterAPIServerAdmissionWebhookErrors: Needs to investigate what is happening
  • WorkloadClusterEtcdMetricsMissing: We do not have any etcd metrics because we cannot monitor etcd on managed clusters. We need to figure out how to skip this alert for eks clusters
  • PrometheusAgentShardsMissing: Not sure why it pages

@QuentinBisson
Copy link

QuentinBisson commented Sep 21, 2023

Status Update

🟢 Prometheus operator and kube-state-metrics keep getting evicted by VPA because the recommender is not able to get metrics from metrics server. Relevant PM: https://github.com/giantswarm/giantswarm/issues/28252

🔴 Alerting
Only WorkloadClusterEtcdMetricsMissing is left so I will create an issue to handle that. #2843

@QuentinBisson
Copy link

What to do about about metrics that rely on etcd_kubernetes_resources_count?

@T-Kukawka
Copy link
Contributor

What to do about about metrics that rely on etcd_kubernetes_resources_count?

they will have to go away as we do not manage ETCD anymore in this scenario @QuentinBisson

@QuentinBisson
Copy link

@T-Kukawka
Copy link
Contributor

ah i see :( yeah then we have to align with BigMac and Shield how else this could be monitored or is relevant even

@QuentinBisson
Copy link

True but then if we can replace those 2 alerts with something else, we can probably remove the etcd-kubernetes-resources-count-exporter component altogether because it seems that it's the only use case

@T-Kukawka
Copy link
Contributor

true, i believe it should be removed for EKS at least ( we still use it in CAPI and Vintage especially with incidents monitoring etc when ETCD is overflown)

@QuentinBisson
Copy link

oh sure :)

@QuentinBisson
Copy link

QuentinBisson commented Sep 28, 2023

🟢 Scraping targets:

Once it makes it into default-apps-eks, we can release and have everything running.

🟠 Alerting is in testing:
image

🟢 Created issues to get rid of etcd

@QuentinBisson
Copy link

QuentinBisson commented Oct 2, 2023

🟢 Alerts are green. The pending/firing alerts are related to either this PM https://github.com/giantswarm/giantswarm/issues/28252 or https://github.com/giantswarm/giantswarm/issues/27558
image

@QuentinBisson
Copy link

@TheoBrigitte We're all done for now, this is now blocked because of the 2 issues linked #2832 (comment).

I added them at the top as well

@QuentinBisson
Copy link

@TheoBrigitte do we still need this issue as we ensured monitoring is working and the issues exists for other teams?

@QuentinBisson
Copy link

All our issues are fixed, the rest is distributed to teams closing

@TheoBrigitte TheoBrigitte changed the title Evaluate monitoring on managed EKS WC Add monitoring support for managed EKS WC Nov 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants