Adjust AlertManagerAPI to avoid using multiple spaces in various attibutes of alert #12237

vkuznet · 2025-01-21T16:35:29Z

Fixes #11911

Status

ready

Description

This PR provides the following set of changes:

Add new normalize_spaces function in AlertManagerAPI and use it for all alert attributes
- this will strip out all empty spaces with more then 2 space characters
Add UUID attribute to alert to make it traceable in WM logs
Add logger printout of alert (including UUID) with ALERT prefix when we send it over to Prometheus/AM URL
- this message will then show in any WM log and become traceable via UUID

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

I modified AlertManager configuration where I performed the following changes:

https://gitlab.cern.ch/cmsmonitoring/cmsmon-configs/-/merge_requests/69
- remove personal emails in some routes and instead use either e-group email or another route, e.g. change wmagent-slack receiver route to dmwm-admins
https://gitlab.cern.ch/dmwm/wmcore-docs/-/merge_requests/73
- provides additional documentation

External dependencies / deployment changes

dmwm-bot · 2025-01-21T16:46:21Z

Jenkins results:

Python3 Unit tests: succeeded
- 2 changes in unstable tests
Python3 Pylint check: succeeded
- 6 comments to review
Pycodestyle check: succeeded
- 1 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/295/artifact/artifacts/PullRequestReport.html

dmwm-bot · 2025-01-21T19:38:32Z

Jenkins results:

Python3 Unit tests: failed
- 1 new failures
- 1 changes in unstable tests
Python3 Pylint check: succeeded
- 5 warnings
- 60 comments to review
Pycodestyle check: succeeded
- 23 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/296/artifact/artifacts/PullRequestReport.html

dmwm-bot · 2025-01-21T19:42:29Z

Jenkins results:

Python3 Unit tests: succeeded
- 3 changes in unstable tests
Python3 Pylint check: succeeded
- 5 warnings
- 60 comments to review
Pycodestyle check: succeeded
- 23 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/297/artifact/artifacts/PullRequestReport.html

dmwm-bot · 2025-01-22T15:21:49Z

Jenkins results:

Python3 Unit tests: succeeded
- 2 changes in unstable tests
Python3 Pylint check: succeeded
- 5 warnings
- 60 comments to review
Pycodestyle check: succeeded
- 23 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/298/artifact/artifacts/PullRequestReport.html

…ibutes of alert and add UUID label for better traceability of WM alerts

dmwm-bot · 2025-01-22T15:46:41Z

Jenkins results:

Python3 Unit tests: succeeded
- 2 changes in unstable tests
Python3 Pylint check: succeeded
- 5 warnings
- 60 comments to review
Pycodestyle check: succeeded
- 23 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/299/artifact/artifacts/PullRequestReport.html

mapellidario

Thanks valentin, looks good to me!

amaltaro

Was this development also supposed to ensure that the right audience is receiving these alerts?
To me, that is the most important development that we needed. Otherwise we keep creating alerts that are meaningless to the group receiving them (or we have to bridge these alerts with Ops)

src/python/WMCore/Services/AlertManager/AlertManagerAPI.py

amaltaro · 2025-01-23T20:00:36Z

src/python/WMCore/Services/AlertManager/AlertManagerAPI.py

        res = self.mgr.getdata(self.alertManagerUrl, params=params, headers=self.headers, verb='POST')
+        self.logger.info("ALERT: %s, status=%s", params, res)


This might not be a good idea, as some alerts might be very large (e.g., with a list of block names).
In addition, there is no way to control the log level of this class, so I am inclined to keep it out.

Alan, this is valid concern, but to make things traceable (i.e. match alert with WM logs we need something to present in WM logs). I'm changing this to alert UUID which will be sufficient to make such cross-check.

vkuznet · 2025-01-28T12:25:50Z

On WM meeting I've been asked to look up and summarize all alerts we send in WM codebase. Here is the summary. Alerts send from the following places:

MSOutput
- campaign not found
- data tier not found
- generic error
MSRuleCleaner
- status advanced expired
MSTransferor
- PU misconfiguration error
- UnknownTransferError
- TransferCouchDBError
- LargeInputData

@anpicci, @amaltaro please evaluate the aforementioned alerts and let me know specific destination routes where they should be delivered.

amaltaro · 2025-01-28T14:07:23Z

Given that we are trying to cover all alerts, not only those routed through Prometheus, let me add two more for our discussion:

WMAgent
- Component restart
- Proxy expiration warning

vkuznet · 2025-01-28T14:18:19Z

Alan, it is unclear to me your post about WMAgent: component restart, and proxy expiration. According to grep I don't see any sendAlert call in these cases and neither I see usage of amtools. I want to clarify are you saying that we should have alerts in these instances or are you saying that we do have alerts. If later, please let me know where those are triggered and how we send them.

amaltaro · 2025-01-28T14:37:13Z

It is an email alert, not routed through Prometheus - hence not using WMCore library and/or amtool (also because amtool would not work off-site). Reference is: https://github.com/dmwm/CMSKubernetes/blob/master/docker/pypi/wmagent/init.sh#L340-L342 and one of them is embedded in the script (component-restart)

vkuznet · 2025-02-03T17:46:25Z

Here is what I just added to https://gitlab.cern.ch/dmwm/wmcore-docs/-/merge_requests/73 documentation

WM alerts

At the moment we have three categories of alerts:

alerts defined by WMCore code base
WMAgent alerts covering component restart and proxy expiration warnings, see here
and alerts defined in CMS WM central services and WMAgent dashboard

amaltaro · 2025-02-03T18:43:50Z

And there is one more Grafana alert - set up by the SI team - which I don't know how to find it. But here it is:

Running Jobs per Schedd - actually, schedds that entered in CurbMatchmaking

vkuznet · 2025-02-03T18:51:02Z

And there is one more Grafana alert - set up by the SI team - which I don't know how to find it. But here it is:

Running Jobs per Schedd - actually, schedds that entered in CurbMatchmaking

Thanks, I found its link in dashboard, and updated wmcore docs PR and in my summary above

vkuznet · 2025-02-04T18:28:59Z

I also clarified with CMS Monitoring group about alerts on Grafana. Here is what I found:

alerts in Grafana are managed via separate instance of AlertManager which runs on MONIT side,
therefore if we want to add specific route for any alert in Grafana dashboard we have two options:
- either contact MONIT team and ask to adjust Grafana AM to add new route, or
- specify this route directly in Grafana plot, there is a button to create new alert rule (which I think can be used to add new notification, I didn't try it though)

Once we clarify which alert will be route to which destinations we may create new notification for alerts in grafana or work with CMSMonitoring/MONIT teams to add such routes.

Adjust AlertManagerAPI to avoid using multiple spaces in various attr…

a621a28

…ibutes of alert and add UUID label for better traceability of WM alerts

vkuznet force-pushed the fix-issue-11911 branch from 23e268b to a621a28 Compare January 22, 2025 15:33

vkuznet requested review from amaltaro, todor-ivanov and mapellidario January 22, 2025 15:54

mapellidario approved these changes Jan 22, 2025

View reviewed changes

todor-ivanov approved these changes Jan 23, 2025

View reviewed changes

amaltaro requested changes Jan 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adjust AlertManagerAPI to avoid using multiple spaces in various attibutes of alert #12237

Adjust AlertManagerAPI to avoid using multiple spaces in various attibutes of alert #12237

vkuznet commented Jan 21, 2025 •

edited

Loading

dmwm-bot commented Jan 21, 2025

dmwm-bot commented Jan 21, 2025

dmwm-bot commented Jan 21, 2025

dmwm-bot commented Jan 22, 2025

dmwm-bot commented Jan 22, 2025

mapellidario left a comment

amaltaro left a comment

amaltaro Jan 23, 2025

vkuznet Jan 27, 2025

vkuznet commented Jan 28, 2025

amaltaro commented Jan 28, 2025 •

edited

Loading

vkuznet commented Jan 28, 2025

amaltaro commented Jan 28, 2025

vkuznet commented Feb 3, 2025 •

edited

Loading

amaltaro commented Feb 3, 2025

vkuznet commented Feb 3, 2025

vkuznet commented Feb 4, 2025

		res = self.mgr.getdata(self.alertManagerUrl, params=params, headers=self.headers, verb='POST')
		self.logger.info("ALERT: %s, status=%s", params, res)

Adjust AlertManagerAPI to avoid using multiple spaces in various attibutes of alert #12237

Are you sure you want to change the base?

Adjust AlertManagerAPI to avoid using multiple spaces in various attibutes of alert #12237

Conversation

vkuznet commented Jan 21, 2025 • edited Loading

Status

Description

Is it backward compatible (if not, which system it affects?)

Related PRs

External dependencies / deployment changes

dmwm-bot commented Jan 21, 2025

dmwm-bot commented Jan 21, 2025

dmwm-bot commented Jan 21, 2025

dmwm-bot commented Jan 22, 2025

dmwm-bot commented Jan 22, 2025

mapellidario left a comment

Choose a reason for hiding this comment

amaltaro left a comment

Choose a reason for hiding this comment

amaltaro Jan 23, 2025

Choose a reason for hiding this comment

vkuznet Jan 27, 2025

Choose a reason for hiding this comment

vkuznet commented Jan 28, 2025

amaltaro commented Jan 28, 2025 • edited Loading

vkuznet commented Jan 28, 2025

amaltaro commented Jan 28, 2025

vkuznet commented Feb 3, 2025 • edited Loading

WM alerts

amaltaro commented Feb 3, 2025

vkuznet commented Feb 3, 2025

vkuznet commented Feb 4, 2025

vkuznet commented Jan 21, 2025 •

edited

Loading

amaltaro commented Jan 28, 2025 •

edited

Loading

vkuznet commented Feb 3, 2025 •

edited

Loading