-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adjust AlertManagerAPI to avoid using multiple spaces in various attibutes of alert #12237
base: master
Are you sure you want to change the base?
Conversation
Jenkins results:
|
Jenkins results:
|
Jenkins results:
|
Jenkins results:
|
…ibutes of alert and add UUID label for better traceability of WM alerts
23e268b
to
a621a28
Compare
Jenkins results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks valentin, looks good to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this development also supposed to ensure that the right audience is receiving these alerts?
To me, that is the most important development that we needed. Otherwise we keep creating alerts that are meaningless to the group receiving them (or we have to bridge these alerts with Ops)
res = self.mgr.getdata(self.alertManagerUrl, params=params, headers=self.headers, verb='POST') | ||
self.logger.info("ALERT: %s, status=%s", params, res) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might not be a good idea, as some alerts might be very large (e.g., with a list of block names).
In addition, there is no way to control the log level of this class, so I am inclined to keep it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alan, this is valid concern, but to make things traceable (i.e. match alert with WM logs we need something to present in WM logs). I'm changing this to alert UUID which will be sufficient to make such cross-check.
On WM meeting I've been asked to look up and summarize all alerts we send in WM codebase. Here is the summary. Alerts send from the following places:
@anpicci, @amaltaro please evaluate the aforementioned alerts and let me know specific destination routes where they should be delivered. |
Given that we are trying to cover all alerts, not only those routed through Prometheus, let me add two more for our discussion:
|
Alan, it is unclear to me your post about WMAgent: component restart, and proxy expiration. According to grep I don't see any sendAlert call in these cases and neither I see usage of amtools. I want to clarify are you saying that we should have alerts in these instances or are you saying that we do have alerts. If later, please let me know where those are triggered and how we send them. |
It is an email alert, not routed through Prometheus - hence not using WMCore library and/or amtool (also because amtool would not work off-site). Reference is: https://github.com/dmwm/CMSKubernetes/blob/master/docker/pypi/wmagent/init.sh#L340-L342 and one of them is embedded in the script (component-restart) |
Here is what I just added to https://gitlab.cern.ch/dmwm/wmcore-docs/-/merge_requests/73 documentation WM alertsAt the moment we have three categories of alerts:
|
And there is one more Grafana alert - set up by the SI team - which I don't know how to find it. But here it is:
|
Thanks, I found its link in dashboard, and updated wmcore docs PR and in my summary above |
I also clarified with CMS Monitoring group about alerts on Grafana. Here is what I found:
Once we clarify which alert will be route to which destinations we may create new notification for alerts in grafana or work with CMSMonitoring/MONIT teams to add such routes. |
Fixes #11911
Status
ready
Description
This PR provides the following set of changes:
normalize_spaces
function in AlertManagerAPI and use it for all alert attributesALERT
prefix when we send it over to Prometheus/AM URLIs it backward compatible (if not, which system it affects?)
YES
Related PRs
I modified AlertManager configuration where I performed the following changes:
wmagent-slack
receiver route todmwm-admins
External dependencies / deployment changes