Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support automated site monitoring/health checks #343

Open
twinkarma opened this issue Apr 17, 2023 · 1 comment
Open

Support automated site monitoring/health checks #343

twinkarma opened this issue Apr 17, 2023 · 1 comment

Comments

@twinkarma
Copy link
Collaborator

twinkarma commented Apr 17, 2023

Ian is planning to put in a service that monitors Teamware for liveness and readiness. Further discussion is needed on what we should put into Teamware in order to support this.

  • Liveness test should not require any modification as it just checks that index.html is served or not
  • Readiness test should check that the database (and what else?) is working and return a 5xx error if not.
    • It might be useful to provide JSON response if we feel there's any stats useful to report, this can be logged as part of monitoring.
@ianroberts
Copy link
Member

ianroberts commented Apr 17, 2023

To clarify - when running under kubernetes there are two types of health check probes that are configured in the helm chart for the django backend container:

  • "liveness" probe which determines whether the container is running. Repeated failures of the liveness probe cause the container to be killed and restarted - the assumption is that these problems are terminal and we need to reset the container to a known good state
  • "readiness" probe which determines whether this container is able to serve incoming HTTP requests. Failures of the readiness probe cause the pod to be removed from the set of active endpoints but (importantly) do not cause the container to be restarted - the assumption is that these are transient problems that can be solved by just waiting

At present these both check the same thing - does a GET of the root path return a successful (399 or lower) response code. This makes sense for the liveness check, but for readiness we should add a separate endpoint that checks that the backend can query the database. If the database is down the backend cannot serve requests, but it should not be killed, it simply waits until the database comes back. To be compatible with the k8s probe system this should be a simple GET endpoint that indicates success with a 200 response code and failure with a 5xx.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants