Support automated site monitoring/health checks #343

twinkarma · 2023-04-17T12:27:09Z

Ian is planning to put in a service that monitors Teamware for liveness and readiness. Further discussion is needed on what we should put into Teamware in order to support this.

Liveness test should not require any modification as it just checks that index.html is served or not
Readiness test should check that the database (and what else?) is working and return a 5xx error if not.
- It might be useful to provide JSON response if we feel there's any stats useful to report, this can be logged as part of monitoring.

The text was updated successfully, but these errors were encountered:

ianroberts · 2023-04-17T12:43:47Z

To clarify - when running under kubernetes there are two types of health check probes that are configured in the helm chart for the django backend container:

"liveness" probe which determines whether the container is running. Repeated failures of the liveness probe cause the container to be killed and restarted - the assumption is that these problems are terminal and we need to reset the container to a known good state
"readiness" probe which determines whether this container is able to serve incoming HTTP requests. Failures of the readiness probe cause the pod to be removed from the set of active endpoints but (importantly) do not cause the container to be restarted - the assumption is that these are transient problems that can be solved by just waiting

At present these both check the same thing - does a GET of the root path return a successful (399 or lower) response code. This makes sense for the liveness check, but for readiness we should add a separate endpoint that checks that the backend can query the database. If the database is down the backend cannot serve requests, but it should not be killed, it simply waits until the database comes back. To be compatible with the k8s probe system this should be a simple GET endpoint that indicates success with a 200 response code and failure with a 5xx.

ianroberts mentioned this issue Apr 17, 2023

Better frontend error page when backend is down #344

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support automated site monitoring/health checks #343

Support automated site monitoring/health checks #343

twinkarma commented Apr 17, 2023 •

edited

Loading

ianroberts commented Apr 17, 2023 •

edited

Loading

Support automated site monitoring/health checks #343

Support automated site monitoring/health checks #343

Comments

twinkarma commented Apr 17, 2023 • edited Loading

ianroberts commented Apr 17, 2023 • edited Loading

twinkarma commented Apr 17, 2023 •

edited

Loading

ianroberts commented Apr 17, 2023 •

edited

Loading