Ignore Prometheus metrics for service status when too many are flagged as down #127

nkinkade · 2023-03-21T17:12:38Z

Today, as far as I know, if there is some problem with the script_exporter, and for any reason it reports that a lot, or even all, e2e tests to experiments are failing, then Locate will remove those services from possible selection in client queries. This has happened to us before, where script_exporter marked all ndt services as down when they were really not down, causing a global outage, since mlab-ns thought that none of the platform was healthy. To avoid this, in mlab-ns, we implemented a safety net where if monitoring says that more than 25% of the platform is down, then it ignores the monitoring data and just continues to use its cached service status data:

https://github.com/m-lab/mlab-ns/blob/main/server/mlabns/handlers/update.py#L371
https://github.com/m-lab/mlab-ns/blob/main/server/mlabns/handlers/update.py#L498

Since problems with script_exporter or GCP networking have happened before, I believe we should implement a similar safety check into Locate, such that when script_exporter reports more than a certain percentage of the platform as down, that Locate will ignore the signal from script_exporter until the percentage is back above the threshold. With Locate this should be even safer than with mlab-ns, since Locate still has the heartbeat signal to rely on whereas mlab-ns had/has only the monitoring signals.

Perhaps Locate could use logic like: if script_exporter thinks that more than 25% of all services are down and heartbeat says they are healthy, then ignore the script_exporter monitoring data?

nkinkade assigned cristinaleonr Mar 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore Prometheus metrics for service status when too many are flagged as down #127

Ignore Prometheus metrics for service status when too many are flagged as down #127

nkinkade commented Mar 21, 2023

Ignore Prometheus metrics for service status when too many are flagged as down #127

Ignore Prometheus metrics for service status when too many are flagged as down #127

Comments

nkinkade commented Mar 21, 2023