Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maintain datadog monitors via code #1673

Open
boolafish opened this issue Jul 28, 2020 · 6 comments
Open

Maintain datadog monitors via code #1673

boolafish opened this issue Jul 28, 2020 · 6 comments

Comments

@boolafish
Copy link
Contributor

boolafish commented Jul 28, 2020

Why

We are having more and more alarms/metric/monitors in data dog now. Currently it is a messy manual process. Also, there are 2 possible ways to trigger an alarm. Could be via application code directly or via data dog monitor.

We have some alarm that is directly triggered in application code. However, it would be great if all alarm can be just in data dog and the application only need to emit metric. Currently for most cases we are emitting both metric and alarm via application code.

But even not doing the simplification, using some code as infra to maintain our current messy monitors might be already beneficial enough.

Note

One potential tool is terraform: https://www.terraform.io/docs/providers/datadog/r/monitor.html
Seems like supporting datadog too.

@InoMurko
Copy link
Contributor

Just a note: you're using the word alarm, which has two definitions:

Application alarm that allows the application to be reactive.

Metric alarm.

@boolafish
Copy link
Contributor Author

boolafish commented Jul 28, 2020

My intention is [alarm_word_placeholder] that would notify engineer to react on. I think datadog use the word monitor. Just I am used to use the word alarm previously 😅

@unnawut
Copy link
Contributor

unnawut commented Aug 6, 2020

Just to expand Ino's comment a bit. The in-app alarms right now are mostly the ones that signals that the service is not in good shape to accept new transactions, e.g. :boot_in_progress, :ethereum_connection_error, :ethereum_stalled_sync, :invalid_fee_source, :main_supervisor_halted. And so when these ones are raised, new transactions are rejected immediately. So I think they'll have to remain in the application? And the rest of them, that needs human's help are already in datadog.

The only one that seems out of place is :system_memory_too_high which is being removed in #1678 (comment)

@boolafish
Copy link
Contributor Author

If we want to simplify a bit and standardize everything, we can make those :boot_in_progress metric to DG as well. In app they still handles the original logic to react to those but instead of sending alarm directly, it sends metric instead. So we have maintained list of monitors/alarms in a single place to see what we have.

But....might not bring much benefit as we already have those code in place and is working fine. Just some burden to be able to see what are all the monitors we have.

@unnawut
Copy link
Contributor

unnawut commented Aug 7, 2020

Ah I see. I think we can send an increment metric for each of the event above, then we can start to see their frequency patterns. The alarm would also be treated more properly by setting alarm threshold to > 0 for each metric. Right now they're sort of reported as discontinuous events.

But I concur that it's not high priority given other things we have in hand at the moment.

@boolafish
Copy link
Contributor Author

boolafish commented Aug 7, 2020

lol my personally feeling is nobody would take this up at all hahaha unless gold team board become cleaner 😅

But can we pick this up in new services like chch v2 to be a standard approach for new service ?! @InoMurko @achiurizo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants