Routine Status Checks #30

softbobo · 2019-11-23T06:07:32Z

the bot should send out a message to the admins to check, if everything is up and running, say, every 12 hours
the message could include more info like: users newly registered. but we should focus on the routine check

obitech · 2019-11-27T20:20:47Z

I'm against this. This leads to "alert spam" and people will just ignore the message. It's better to let the bot emit metrics, e.g. use Prometheus

softbobo · 2019-11-28T07:03:28Z

Yeah, metrics are fine, too and offer more insight on where we might improve the project. I second this.

softbobo · 2019-12-01T09:49:34Z

I think, prometheus might be a bit of an overkill for our little project. Maybe let's collect ideas first what data we want the bot to emit regularly? Right now i can think of:

total number of users
users freshly subscribed
number of total interactions
Also, what would be a sensible interval for such messages?

pma-ableton · 2019-12-01T15:13:07Z

I'd like to have the option to see a list of all subscribers sorted by subscription date.

obitech · 2019-12-02T03:28:53Z

Let me explain my reasoning for Prometheus and against any type of push based system in this case. Sorry for the wall of text ahead.

We should really first think about what do we need this information for:

Operational monitoring: what's the health of my application?
"Business" metrics: How many people are using it (etc.)?

In any case we need to instrument the the code with some sort of library to actually extract the metrics. This is known as white-box monitoring, so we're extracting it directly from the running application. This usually requires some sort of async loop running alongside the main logic loop and a new web endpoint like /metrics where the information gets exposed. This is the Prometheus model.

I think, prometheus might be a bit of an overkill for our little project.

I disagree:

Prometheus is very simple. It's just a single statically linked binary that we drop on the server next to the bot.
It doesn't require much resources, it can easily run on the host we have right now.
It has the biggest community and tooling available. It's easy to use the Prometheus client library to extract metrics from our application.
I have lots of experience using Prometheus, it's what I do at work every day.

Why I'm against a push based system via messages to admins:

Alarm fatigue: sending those messages will lead to a lot of spam essentially and people will start ignoring it after a while. Also in 90% of those messages nothing will change; imagine you wake up in the morning and have received 20 messages from the bot which you now have to all look through and realise nothing has changed over the night. You do this three nights and the you just ignore them. Then people will request to stop sending all those messages and we need to think of a mechanism to turn them off again (but just for some people?) which will require more custom logic. This is a lot of complexity for very little return.

Also imagine we're pushing every 30 minutes: how can we tell that the bot hasn't been down from minute 10 to minute 25 ? And if we wake up in the morning and expect 10 messages but there are only 7, what do we do? Is this actionable information we can use?

This is what I would do:

Think of operational and business metrics we need to track. We should schedule a meeting for that to be honest.
Use the Prometheus Python Client to instrument the bot.
We deploy Prometheus alongside it.
Think of ways to visualise it, there are several options.
Maybe define alerts on the exposed metrics as well.

I'd like to have the option to see a list of all subscribers sorted by subscription date.

This could be done with a simple database query. Alternatively we can think of a /stats command which gives admins a summary of what happened in the last 24h or so.

obitech added this to the API milestone Dec 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Routine Status Checks #30

Routine Status Checks #30

softbobo commented Nov 23, 2019

obitech commented Nov 27, 2019

softbobo commented Nov 28, 2019

softbobo commented Dec 1, 2019

pma-ableton commented Dec 1, 2019

obitech commented Dec 2, 2019

Routine Status Checks #30

Routine Status Checks #30

Comments

softbobo commented Nov 23, 2019

obitech commented Nov 27, 2019

softbobo commented Nov 28, 2019

softbobo commented Dec 1, 2019

pma-ableton commented Dec 1, 2019

obitech commented Dec 2, 2019