Add monitoring support for managed EKS WC #2832

TheoBrigitte · 2023-09-19T10:36:14Z

QuentinBisson · 2023-09-20T10:49:15Z

Started by upgrading the observability-bundle in default-apps-eks giantswarm/default-apps-eks#8.

I now have a cluster running on grizzly

QuentinBisson · 2023-09-20T12:27:11Z

Monitoring:
🔴 Prometheus is currently not able to list pods and so on:

ts=2023-09-20T10:52:22.949Z caller=klog.go:116 level=error component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231: Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"https://https/F5AB99CD4973C48F8F80414C4693ED04.gr7.eu-west-2.eks.amazonaws.com:443/api/v1/namespaces/kube-system/pods?limit=500&resourceVersion=0\": dial tcp: lookup https on 172.31.0.10:53: no such host"
ts=2023-09-20T10:52:23.772Z caller=klog.go:108 level=warn component=k8s_client_runtime func=Warningf msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231: failed to list *v1.Endpoints: Get \"https://https/F5AB99CD4973C48F8F80414C4693ED04.gr7.eu-west-2.eks.amazonaws.com:443/api/v1/namespaces/kube-system/endpoints?limit=500&resourceVersion=0\": dial tcp: lookup https on 172.31.0.10:53: no such host"
ts=2023-09-20T10:52:23.772Z caller=klog.go:116 level=error component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: Get \"https://https/F5AB99CD4973C48F8F80414C4693ED04.gr7.eu-west-2.eks.amazonaws.com:443/api/v1/namespaces/kube-system/endpoints?limit=500&resourceVersion=0\": dial tcp: lookup https on 172.31.0.10:53: no such host"
ts=2023-09-20T10:52:24.107Z caller=klog.go:108 level=warn component=k8s_client_runtime func=Warningf msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231: failed to list *v1.Pod: Get \"https://https/F5AB99CD4973C48F8F80414C4693ED04.gr7.eu-west-2.eks.amazonaws.com:443/api/v1/pods?limit=500&resourceVersion=0\": dial tcp: lookup https on 172.31.0.10:53: no such host"
ts=2023-09-20T10:52:24.107Z caller=klog.go:116 level=error component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231: Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"https://https/F5AB99CD4973C48F8F80414C4693ED04.gr7.eu-west-2.eks.amazonaws.com:443/api/v1/pods?limit=500&resourceVersion=0\": dial tcp: lookup https on 172.31.0.10:53: no such host"
ts=2023-09-20T10:52:24.395Z caller=klog.go:108 level=warn component=k8s_client_runtime func=Warningf msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231: failed to list *v1.Endpoints: Get \"https://https/F5AB99CD4973C48F8F80414C4693ED04.gr7.eu-west-2.eks.amazonaws.com:443/api/v1/endpoints?limit=500&resourceVersion=0\": dial tcp: lookup https on 172.31.0.10:53: no such host"
ts=2023-09-20T10:52:24.395Z caller=klog.go:116 level=error component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: Get \"https://https/F5AB99CD4973C48F8F80414C4693ED04.gr7.eu-west-2.eks.amazonaws.com:443/api/v1/endpoints?limit=500&resourceVersion=0\": dial tcp: lookup https on 172.31.0.10:53: no such host"

🟡 Scraping targets:
List of targets is currently empty, most likely related to the above issue

🟢 Alerting:

🟡 Alert status in Prometheus
A few are in pending, to be investigated

🟢 Grafana:

🟢 Grafana cloud:

QuentinBisson · 2023-09-20T13:36:32Z

Status Update

Monitoring:
🟢 Prometheus fixed with giantswarm/prometheus-meta-operator#1375

🔴 Scraping targets:

WC Prom:
Agent:
- Fixing net-exporter netpol Fix scraping of net-exporter on Managed EKS clusters net-exporter#299

🟡 Alerting
A few are in pending, to be investigated

QuentinBisson · 2023-09-20T13:57:08Z

Related giantswarm/default-apps-eks#10

QuentinBisson · 2023-09-21T08:59:26Z

Status Update

🟠 Scraping targets:
Prometheus on the MC cannot scrape cert-exporter, cert-manager-app and chart-operator because the api server proxy is either not configured properly on Managed EKS or not working at all. Fixing this would require a deep investigation from @giantswarm/team-phoenix.
As we are striving to move to service monitors, it would definitely be easier for everyone if we moved the failing apps to use service monitors instead:

cert-manager-app: https://github.com/giantswarm/giantswarm/issues/27557 This appears solved in 3.3.0 but cert manager is stuck in pending upgrade in EKS managed clusters. Related PM https://github.com/giantswarm/giantswarm/issues/28250
cert-exporter-app: https://github.com/giantswarm/giantswarm/issues/27576
chart-operator: https://github.com/giantswarm/giantswarm/issues/27558

🔴 Prometheus-operator and Kube-state-metrics are unstable
The pods keep getting recreated every 1 minute even though the pods are in a running state. We will need to dig deeper into it.

🔴 Alerting

The current list of alert is big but of those, only 3 are relevants because most of the alerts are related to the scraping errors above:

WorkloadClusterAPIServerAdmissionWebhookErrors: Needs to investigate what is happening
WorkloadClusterEtcdMetricsMissing: We do not have any etcd metrics because we cannot monitor etcd on managed clusters. We need to figure out how to skip this alert for eks clusters
PrometheusAgentShardsMissing: Not sure why it pages

QuentinBisson · 2023-09-21T10:40:49Z

Status Update

🟢 Prometheus operator and kube-state-metrics keep getting evicted by VPA because the recommender is not able to get metrics from metrics server. Relevant PM: https://github.com/giantswarm/giantswarm/issues/28252

🔴 Alerting
Only WorkloadClusterEtcdMetricsMissing is left so I will create an issue to handle that. #2843

QuentinBisson · 2023-09-28T11:47:24Z

What to do about about metrics that rely on etcd_kubernetes_resources_count?

T-Kukawka · 2023-09-28T11:48:28Z

What to do about about metrics that rely on etcd_kubernetes_resources_count?

they will have to go away as we do not manage ETCD anymore in this scenario @QuentinBisson

QuentinBisson · 2023-09-28T11:49:54Z

I know @T-Kukawka but I'm not sure how those could be replaced.

https://github.com/search?q=repo%3Agiantswarm%2Fprometheus-rules%20etcd_kubernetes_resources_count&type=code

T-Kukawka · 2023-09-28T11:52:35Z

ah i see :( yeah then we have to align with BigMac and Shield how else this could be monitored or is relevant even

QuentinBisson · 2023-09-28T11:54:11Z

True but then if we can replace those 2 alerts with something else, we can probably remove the etcd-kubernetes-resources-count-exporter component altogether because it seems that it's the only use case

T-Kukawka · 2023-09-28T11:55:27Z

true, i believe it should be removed for EKS at least ( we still use it in CAPI and Vintage especially with incidents monitoring etc when ETCD is overflown)

QuentinBisson · 2023-09-28T11:56:52Z

oh sure :)

QuentinBisson · 2023-09-28T13:09:13Z

🟢 Scraping targets:

Only chart operator is missing (no service monitors) but it is being addressed by honeybadger this week: https://github.com/giantswarm/giantswarm/issues/27558.

Once it makes it into default-apps-eks, we can release and have everything running.

🟠 Alerting is in testing:

🟢 Created issues to get rid of etcd

QuentinBisson · 2023-10-02T07:54:37Z

🟢 Alerts are green. The pending/firing alerts are related to either this PM https://github.com/giantswarm/giantswarm/issues/28252 or https://github.com/giantswarm/giantswarm/issues/27558

QuentinBisson · 2023-10-02T07:56:19Z

@TheoBrigitte We're all done for now, this is now blocked because of the 2 issues linked #2832 (comment).

I added them at the top as well

QuentinBisson · 2023-10-04T15:04:27Z

@TheoBrigitte do we still need this issue as we ensured monitoring is working and the issues exists for other teams?

QuentinBisson · 2023-10-10T13:43:02Z

All our issues are fixed, the rest is distributed to teams closing

TheoBrigitte mentioned this issue Sep 19, 2023

Test Atlas components on GS EKS based CAPA solution #2817

Closed

TheoBrigitte added team/atlas Team Atlas topic/alert topic/monitoring provider/eks Planning labels Sep 19, 2023

TheoBrigitte changed the title ~~Evaluate monitoring on EKS based CAPA WC~~ Evaluate monitoring on managed EKS WC Sep 19, 2023

QuentinBisson self-assigned this Sep 19, 2023

QuentinBisson mentioned this issue Sep 20, 2023

Fix API Server url for Managed EKS giantswarm/prometheus-meta-operator#1375

Merged

3 tasks

This was referenced Sep 21, 2023

Support per workload cluster alerting rules #2843

Closed

Add kube-proxy to the list of ignored targets giantswarm/prometheus-meta-operator#1379

Merged

QuentinBisson added the blocked label Sep 21, 2023

QuentinBisson closed this as completed Sep 28, 2023

QuentinBisson reopened this Sep 28, 2023

TheoBrigitte removed the Planning label Oct 3, 2023

QuentinBisson closed this as completed Oct 10, 2023

TheoBrigitte removed the blocked label Nov 10, 2023

TheoBrigitte changed the title ~~Evaluate monitoring on managed EKS WC~~ Add monitoring support for managed EKS WC Nov 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add monitoring support for managed EKS WC #2832

Add monitoring support for managed EKS WC #2832

TheoBrigitte commented Sep 19, 2023 •

edited

Loading

QuentinBisson commented Sep 20, 2023

QuentinBisson commented Sep 20, 2023

QuentinBisson commented Sep 20, 2023 •

edited

Loading

QuentinBisson commented Sep 20, 2023

QuentinBisson commented Sep 21, 2023

QuentinBisson commented Sep 21, 2023 •

edited

Loading

QuentinBisson commented Sep 28, 2023

T-Kukawka commented Sep 28, 2023

QuentinBisson commented Sep 28, 2023

T-Kukawka commented Sep 28, 2023

QuentinBisson commented Sep 28, 2023

T-Kukawka commented Sep 28, 2023

QuentinBisson commented Sep 28, 2023

QuentinBisson commented Sep 28, 2023 •

edited

Loading

QuentinBisson commented Oct 2, 2023 •

edited

Loading

QuentinBisson commented Oct 2, 2023

QuentinBisson commented Oct 4, 2023

QuentinBisson commented Oct 10, 2023

Add monitoring support for managed EKS WC #2832

Add monitoring support for managed EKS WC #2832

Comments

TheoBrigitte commented Sep 19, 2023 • edited Loading

Checks

QuentinBisson commented Sep 20, 2023

QuentinBisson commented Sep 20, 2023

QuentinBisson commented Sep 20, 2023 • edited Loading

Status Update

QuentinBisson commented Sep 20, 2023

QuentinBisson commented Sep 21, 2023

Status Update

QuentinBisson commented Sep 21, 2023 • edited Loading

Status Update

QuentinBisson commented Sep 28, 2023

T-Kukawka commented Sep 28, 2023

QuentinBisson commented Sep 28, 2023

T-Kukawka commented Sep 28, 2023

QuentinBisson commented Sep 28, 2023

T-Kukawka commented Sep 28, 2023

QuentinBisson commented Sep 28, 2023

QuentinBisson commented Sep 28, 2023 • edited Loading

QuentinBisson commented Oct 2, 2023 • edited Loading

QuentinBisson commented Oct 2, 2023

QuentinBisson commented Oct 4, 2023

QuentinBisson commented Oct 10, 2023

TheoBrigitte commented Sep 19, 2023 •

edited

Loading

QuentinBisson commented Sep 20, 2023 •

edited

Loading

QuentinBisson commented Sep 21, 2023 •

edited

Loading

QuentinBisson commented Sep 28, 2023 •

edited

Loading

QuentinBisson commented Oct 2, 2023 •

edited

Loading