You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Today, as far as I know, if there is some problem with the script_exporter, and for any reason it reports that a lot, or even all, e2e tests to experiments are failing, then Locate will remove those services from possible selection in client queries. This has happened to us before, where script_exporter marked all ndt services as down when they were really not down, causing a global outage, since mlab-ns thought that none of the platform was healthy. To avoid this, in mlab-ns, we implemented a safety net where if monitoring says that more than 25% of the platform is down, then it ignores the monitoring data and just continues to use its cached service status data:
Since problems with script_exporter or GCP networking have happened before, I believe we should implement a similar safety check into Locate, such that when script_exporter reports more than a certain percentage of the platform as down, that Locate will ignore the signal from script_exporter until the percentage is back above the threshold. With Locate this should be even safer than with mlab-ns, since Locate still has the heartbeat signal to rely on whereas mlab-ns had/has only the monitoring signals.
Perhaps Locate could use logic like: if script_exporter thinks that more than 25% of all services are down and heartbeat says they are healthy, then ignore the script_exporter monitoring data?
The text was updated successfully, but these errors were encountered:
Today, as far as I know, if there is some problem with the script_exporter, and for any reason it reports that a lot, or even all, e2e tests to experiments are failing, then Locate will remove those services from possible selection in client queries. This has happened to us before, where script_exporter marked all ndt services as down when they were really not down, causing a global outage, since mlab-ns thought that none of the platform was healthy. To avoid this, in mlab-ns, we implemented a safety net where if monitoring says that more than 25% of the platform is down, then it ignores the monitoring data and just continues to use its cached service status data:
https://github.com/m-lab/mlab-ns/blob/main/server/mlabns/handlers/update.py#L371
https://github.com/m-lab/mlab-ns/blob/main/server/mlabns/handlers/update.py#L498
Since problems with script_exporter or GCP networking have happened before, I believe we should implement a similar safety check into Locate, such that when script_exporter reports more than a certain percentage of the platform as down, that Locate will ignore the signal from script_exporter until the percentage is back above the threshold. With Locate this should be even safer than with mlab-ns, since Locate still has the heartbeat signal to rely on whereas mlab-ns had/has only the monitoring signals.
Perhaps Locate could use logic like: if script_exporter thinks that more than 25% of all services are down and heartbeat says they are healthy, then ignore the script_exporter monitoring data?
The text was updated successfully, but these errors were encountered: