Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust AlertManagerAPI to avoid using multiple spaces in various attibutes of alert #12237

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

vkuznet
Copy link
Contributor

@vkuznet vkuznet commented Jan 21, 2025

Fixes #11911

Status

ready

Description

This PR provides the following set of changes:

  • Add new normalize_spaces function in AlertManagerAPI and use it for all alert attributes
    • this will strip out all empty spaces with more then 2 space characters
  • Add UUID attribute to alert to make it traceable in WM logs
  • Add logger printout of alert (including UUID) with ALERT prefix when we send it over to Prometheus/AM URL
    • this message will then show in any WM log and become traceable via UUID

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

I modified AlertManager configuration where I performed the following changes:

External dependencies / deployment changes

@dmwm-bot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 2 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 6 comments to review
  • Pycodestyle check: succeeded
    • 1 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/295/artifact/artifacts/PullRequestReport.html

@dmwm-bot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 5 warnings
    • 60 comments to review
  • Pycodestyle check: succeeded
    • 23 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/296/artifact/artifacts/PullRequestReport.html

@dmwm-bot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 3 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 5 warnings
    • 60 comments to review
  • Pycodestyle check: succeeded
    • 23 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/297/artifact/artifacts/PullRequestReport.html

@dmwm-bot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 2 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 5 warnings
    • 60 comments to review
  • Pycodestyle check: succeeded
    • 23 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/298/artifact/artifacts/PullRequestReport.html

…ibutes of alert and add UUID label for better traceability of WM alerts
@dmwm-bot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 2 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 5 warnings
    • 60 comments to review
  • Pycodestyle check: succeeded
    • 23 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/299/artifact/artifacts/PullRequestReport.html

Copy link
Member

@mapellidario mapellidario left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks valentin, looks good to me!

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this development also supposed to ensure that the right audience is receiving these alerts?
To me, that is the most important development that we needed. Otherwise we keep creating alerts that are meaningless to the group receiving them (or we have to bridge these alerts with Ops)

res = self.mgr.getdata(self.alertManagerUrl, params=params, headers=self.headers, verb='POST')
self.logger.info("ALERT: %s, status=%s", params, res)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might not be a good idea, as some alerts might be very large (e.g., with a list of block names).
In addition, there is no way to control the log level of this class, so I am inclined to keep it out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alan, this is valid concern, but to make things traceable (i.e. match alert with WM logs we need something to present in WM logs). I'm changing this to alert UUID which will be sufficient to make such cross-check.

@vkuznet
Copy link
Contributor Author

vkuznet commented Jan 28, 2025

On WM meeting I've been asked to look up and summarize all alerts we send in WM codebase. Here is the summary. Alerts send from the following places:

  • MSOutput
    • campaign not found
    • data tier not found
    • generic error
  • MSRuleCleaner
    • status advanced expired
  • MSTransferor
    • PU misconfiguration error
    • UnknownTransferError
    • TransferCouchDBError
    • LargeInputData

@anpicci, @amaltaro please evaluate the aforementioned alerts and let me know specific destination routes where they should be delivered.

@amaltaro
Copy link
Contributor

amaltaro commented Jan 28, 2025

Given that we are trying to cover all alerts, not only those routed through Prometheus, let me add two more for our discussion:

  • WMAgent
    • Component restart
    • Proxy expiration warning

@vkuznet
Copy link
Contributor Author

vkuznet commented Jan 28, 2025

Alan, it is unclear to me your post about WMAgent: component restart, and proxy expiration. According to grep I don't see any sendAlert call in these cases and neither I see usage of amtools. I want to clarify are you saying that we should have alerts in these instances or are you saying that we do have alerts. If later, please let me know where those are triggered and how we send them.

@amaltaro
Copy link
Contributor

It is an email alert, not routed through Prometheus - hence not using WMCore library and/or amtool (also because amtool would not work off-site). Reference is: https://github.com/dmwm/CMSKubernetes/blob/master/docker/pypi/wmagent/init.sh#L340-L342 and one of them is embedded in the script (component-restart)

@vkuznet
Copy link
Contributor Author

vkuznet commented Feb 3, 2025

Here is what I just added to https://gitlab.cern.ch/dmwm/wmcore-docs/-/merge_requests/73 documentation

WM alerts

At the moment we have three categories of alerts:

@amaltaro
Copy link
Contributor

amaltaro commented Feb 3, 2025

And there is one more Grafana alert - set up by the SI team - which I don't know how to find it. But here it is:

@vkuznet
Copy link
Contributor Author

vkuznet commented Feb 3, 2025

And there is one more Grafana alert - set up by the SI team - which I don't know how to find it. But here it is:

Thanks, I found its link in dashboard, and updated wmcore docs PR and in my summary above

@vkuznet
Copy link
Contributor Author

vkuznet commented Feb 4, 2025

I also clarified with CMS Monitoring group about alerts on Grafana. Here is what I found:

  • alerts in Grafana are managed via separate instance of AlertManager which runs on MONIT side,
  • therefore if we want to add specific route for any alert in Grafana dashboard we have two options:
    • either contact MONIT team and ask to adjust Grafana AM to add new route, or
    • specify this route directly in Grafana plot, there is a button to create new alert rule (which I think can be used to add new notification, I didn't try it though)

Once we clarify which alert will be route to which destinations we may create new notification for alerts in grafana or work with CMSMonitoring/MONIT teams to add such routes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve alert content and adjust their routes
5 participants