Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Routine Status Checks #30

Open
softbobo opened this issue Nov 23, 2019 · 5 comments
Open

Routine Status Checks #30

softbobo opened this issue Nov 23, 2019 · 5 comments
Milestone

Comments

@softbobo
Copy link
Contributor

  • the bot should send out a message to the admins to check, if everything is up and running, say, every 12 hours
  • the message could include more info like: users newly registered. but we should focus on the routine check
@obitech
Copy link
Member

obitech commented Nov 27, 2019

I'm against this. This leads to "alert spam" and people will just ignore the message. It's better to let the bot emit metrics, e.g. use Prometheus

@softbobo
Copy link
Contributor Author

Yeah, metrics are fine, too and offer more insight on where we might improve the project. I second this.

@softbobo
Copy link
Contributor Author

softbobo commented Dec 1, 2019

I think, prometheus might be a bit of an overkill for our little project. Maybe let's collect ideas first what data we want the bot to emit regularly? Right now i can think of:

  • total number of users
  • users freshly subscribed
  • number of total interactions
    Also, what would be a sensible interval for such messages?

@pma-ableton
Copy link

I'd like to have the option to see a list of all subscribers sorted by subscription date.

@obitech
Copy link
Member

obitech commented Dec 2, 2019

Let me explain my reasoning for Prometheus and against any type of push based system in this case. Sorry for the wall of text ahead.

We should really first think about what do we need this information for:

  • Operational monitoring: what's the health of my application?
  • "Business" metrics: How many people are using it (etc.)?

In any case we need to instrument the the code with some sort of library to actually extract the metrics. This is known as white-box monitoring, so we're extracting it directly from the running application. This usually requires some sort of async loop running alongside the main logic loop and a new web endpoint like /metrics where the information gets exposed. This is the Prometheus model.

I think, prometheus might be a bit of an overkill for our little project.

I disagree:

  • Prometheus is very simple. It's just a single statically linked binary that we drop on the server next to the bot.
  • It doesn't require much resources, it can easily run on the host we have right now.
  • It has the biggest community and tooling available. It's easy to use the Prometheus client library to extract metrics from our application.
  • I have lots of experience using Prometheus, it's what I do at work every day.

Why I'm against a push based system via messages to admins:

Alarm fatigue: sending those messages will lead to a lot of spam essentially and people will start ignoring it after a while. Also in 90% of those messages nothing will change; imagine you wake up in the morning and have received 20 messages from the bot which you now have to all look through and realise nothing has changed over the night. You do this three nights and the you just ignore them. Then people will request to stop sending all those messages and we need to think of a mechanism to turn them off again (but just for some people?) which will require more custom logic. This is a lot of complexity for very little return.

Also imagine we're pushing every 30 minutes: how can we tell that the bot hasn't been down from minute 10 to minute 25 ? And if we wake up in the morning and expect 10 messages but there are only 7, what do we do? Is this actionable information we can use?

This is what I would do:

  1. Think of operational and business metrics we need to track. We should schedule a meeting for that to be honest.
  2. Use the Prometheus Python Client to instrument the bot.
  3. We deploy Prometheus alongside it.
  4. Think of ways to visualise it, there are several options.
  5. Maybe define alerts on the exposed metrics as well.

I'd like to have the option to see a list of all subscribers sorted by subscription date.

This could be done with a simple database query. Alternatively we can think of a /stats command which gives admins a summary of what happened in the last 24h or so.

@obitech obitech added this to the API milestone Dec 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants