Set team on Alert Group based on route #3459

mwheeler-ep · 2023-11-29T20:50:29Z

What would you like to see!

An integration might be used by multiple teams and then routed to the correct team using the routes/channel filters. The team for the alert group however seems to be set by the integration team setting and there is no way to change the team during routing.

Product Area

Alert Flow & Configuration

Anything else to add?

Alternatively this might be something that could be added to an escalation step?

github-actions · 2023-11-29T20:50:49Z

The current version of Grafana OnCall, at the time this issue was opened, is v1.3.64. If your issue pertains to an older version of Grafana OnCall, please be sure to list it in the PR description. Thank you 😄!

m4r1u2 · 2024-04-09T07:40:43Z

We also like this feature!
As part of the routing rules at the integration level, route a alert to a team (optional) and a Escalation chain.

donghoon-lee-mrt · 2024-07-23T01:23:30Z

We really want this feature as well!

uschenk · 2024-08-09T19:20:35Z

We really need this feature. We cover 12 teams in one integration and the Alert Groups are all not assigned to a team.

Only workaround sort of is assigning Labels to the alert groups.
But insight/reports can not display alert groups grouped by labels either.

Virenta · 2024-10-03T10:20:06Z

We are in the same boat, we manage single integration, but routing to multiple teams.

Implement it please!

mwheeler-ep · 2024-10-18T03:44:47Z

Grafana team, it might be possible that I could get some work done for this ticket.

There's two approaches (though not mutually exclusive, so both could be implemented if we wanted).

Inherit the team from the escalation chain

Escalation chains already assigned a team, so when an alert group is routed to an escalation chain we could assign that team
Existing Grafana deploys will be impacted by this change and might have unexpected changes?

Escalation step

Add a step that allows changing the team as part of the escalation chain
Requires users to add a step to every escalation chain (not a problem for us because terraform goes brrrrr... but might be annoying for people who don't IaC oncall)

I'd love some advice as to which you'd prefer implemented first or any other changes so that if we do get some time to work on this its ready to go.

mwheeler-ep · 2024-10-20T23:31:37Z

Thinking about the inherit from escalation chain approach and I think it could be implemented in a way that doesn't break existing default behaviour by making it an option on the "trigger escalation chain"

@iskhakov @matiasb - thoughts?

matiasb · 2024-10-22T14:27:35Z

Hi!

Thanks for filing the issue, the details and research around this.

Changing the team assigned to an alert group may not be that simple. Right now there is no team stored along an alert group, but it is got from (and filtered via) the integration it was received. In that sense, it seems it should be possible to switch this to use escalation chain team instead (but as you noted that would mean a behavior change that could have unexpected results, besides not every alert group will have an escalation chain necessarily).

A different alternative I can think of is to make it possible to associate multiple teams to an alert group, so you can have the team from the integration, but also inherit the team from the escalation chain (if different) and even add teams arbitrarily in the future (we would need to discuss this with the team).

OTOH, and to understand the use case (and the possible solutions), is this mostly for tracking metrics / filtering alert groups? (right now that's what you get by setting up teams) what other scenarios you are trying to solve? Thanks!

uschenk · 2024-10-22T16:00:02Z

Besides using the escalation chain, another option to assign the team name to an Alert Group could be within

the integration itself as an option of each routing.
inside the integration step using a script in the template processing to assign a team

...to associate multiple teams to an alert group
...alternative I can think of is to make it possible to associate multiple teams to an alert group...

Besides reporting/filtering, there is also a permission piece to it. Every record (integration record and alert groups that have no team assigned are visible and accessible by everybody who has basic permissions. Restricting access to the users assigned specific team requires a team name for these records.

I am actually also annoyed about integrations labeled as "No Team" as well. These integrations are accessible by ALL users with basic integration permissions.

mwheeler-ep · 2024-10-28T22:27:18Z

Filtering and tracking metrics is the key outcome we are looking for. Hadn't considered the permissions side of things as it doesn't apply to our use case but certainly seems like something that needs resolving as well.

mwheeler-ep · 2024-11-18T02:40:16Z

@matiasb I've been working through / thinking about implementation of this again and wondering a rough where to start. My thoughts based on your comments are as follows

Since alertgroups currently aren't assigned a team - rather a team is grabbed from the integration for the alertgroup - the first change to make would be adding a team property to alertgroups. To begin with this would use the current method of using the integration to assign the team. We should make this property a list of teams* rather than single team to support possible future multi team alertgroups, however this doesn't need to be implemented straight away.
From here we could implement multiple solutions - the one I proposed with the option on the escalation chain, or as an escalation step. The first version of this feature could be to replace the team, rather than append to the list.

* I believe the metrics exporter exports a single team property, so for multi team support a change here would need to be considered.

Would it make sense to start working on point 1 while point 2 is being figured out?

Btw I joined the community slack under @Michaela Wheeler username, though I'm not sure our work hours overlap. Happy to discuss there if it helps :)

matiasb · 2024-11-18T14:42:18Z

@matiasb I've been working through / thinking about implementation of this again and wondering a rough where to start. My thoughts based on your comments are as follows

Since alertgroups currently aren't assigned a team - rather a team is grabbed from the integration for the alertgroup - the first change to make would be adding a team property to alertgroups. To begin with this would use the current method of using the integration to assign the team. We should make this property a list of teams* rather than single team to support possible future multi team alertgroups, however this doesn't need to be implemented straight away.

Makes sense 👍 I would still define a M2M model between alert group and team, even if we restrict one team per alert group at first.

From here we could implement multiple solutions - the one I proposed with the option on the escalation chain, or as an escalation step. The first version of this feature could be to replace the team, rather than append to the list.

Right, having the model decided and defined we should be able to work on any of these (or variations) of this possible paths.

I believe the metrics exporter exports a single team property, so for multi team support a change here would need to be considered.

That's a good point, and this should probably be part of the work in point 1 (ie. changing the model will require us to update how we calculate the metrics; also the team filter for insights is now based on the team from the integration, so some extra work may be needed to keep things consistent).

Would it make sense to start working on point 1 while point 2 is being figured out?

Sounds good!

Btw I joined the community slack under @Michaela Wheeler username, though I'm not sure our work hours overlap. Happy to discuss there if it helps :)

I see :-) feel free to ping me there to talk any details (I'm @matiasb there too)

Thanks for pushing this forward!

uschenk · 2024-11-18T17:55:03Z

I am not familiar with your code nor OnCall internals, but adding support for multiple teams per Alert Group sounds to me
like it needs to reviewed with the current security model in mind, because users are assigned to teams. What if AlertGroup permissions on team level contradict one team vs the other team vs. the user level?

matiasb · 2024-11-18T18:22:56Z

I am not familiar with your code nor OnCall internals, but adding support for multiple teams per Alert Group sounds to me like it needs to reviewed with the current security model in mind, because users are assigned to teams. What if AlertGroup permissions on team level contradict one team vs the other team vs. the user level?

Yeah, in any case I wouldn't enable multiple teams support at first but I would try to keep that in mind.

About permissions, giving it a quick thought, I think as long as one of the teams allow you to access the alert group, you should be able to see it (right now there is no perms per team, only allow users outside the team to access things or not; per-user perms work at a global level too, ie. you cannot restrict access to an integration or schedule).

mwheeler-ep · 2024-11-21T06:03:05Z

Update on this so far:

Have development env running again
Have updated the model to be m2m team for alertgroup
Assigning team as per the integration (like what currently happens)
API view for filtering on the team is working

Todo / things to consider:

Adding to the migration to update old alert groups with team based on integration ? Alternatively can leave the old code there, however I don't think thats a good solution and will likely cause more issues than it solves.
Prometheus exporter - will have to have a think about this - for now if we want to we could just grab the first value. I'll see what options there are.
Permissions - I'm not too familiar with team based permissions - would the intention here be adding a team based permission check that doesn't already exist? That may change current behaviour for users if I'm understanding correctly? @matiasb can you explain what needs to be done here? If I retain available_teams_lookup_arg functionality will it be ok?

mwheeler-ep · 2024-11-22T01:10:24Z

Was able to get a little bit more of this done.
Updated the UI to support multiple teams in alert groups - even though initially this feature won't be used

At the moment I've implemented my initial - update alert group toggle for routes / escalation chains.

Public api has been updated to have teams field. I've left the original team_id but documented it in the api docs the difference.

mwheeler-ep · 2024-11-22T01:12:29Z

One thing I noticed is that performing pnpm generate-types generates a lot of changes and results in an unbuilding frontend. So far I have manually added in changes to workaround this issue. Not sure if this is acceptable / workable?

matiasb · 2024-11-25T12:28:50Z

Update on this so far:

* Have development env running again

* Have updated the model to be m2m team for alertgroup

* Assigning team as per the integration (like what currently happens)

* API view for filtering on the team is working

Nice!

Todo / things to consider:

* Adding to the migration to update old alert groups with team based on integration ? Alternatively can leave the old code there, however I don't think thats a good solution and will likely cause more issues than it solves.

Backfilling existing alert groups could be not that simple for setups where you have millions of alert groups, so not requiring a migration or making it online somehow could be better.

* Prometheus exporter - will have to have a think about this - for now if we want to we could just grab the first value. I'll see what options there are.

👍

* Permissions - I'm not too familiar with team based permissions - would the intention here be adding a team based permission check that doesn't already exist? That may change current behaviour for users if I'm understanding correctly? [@matiasb](https://github.com/matiasb) can you explain what needs to be done here? If I retain `available_teams_lookup_arg` functionality will it be ok?

Right now each team can decide if their resources are only visible to team members or anyone in the organization (via Settings -> Team and Access Settings). I think that if an alert group belongs to a team that allows anyone to get access, then it should be possible to check that alert group details (otherwise, you should be required to be member of any of the associated teams). What do you think? In any case, handling multiple teams can be left for a future iteration too.

matiasb · 2024-11-25T12:30:31Z

One thing I noticed is that performing pnpm generate-types generates a lot of changes and results in an unbuilding frontend. So far I have manually added in changes to workaround this issue. Not sure if this is acceptable / workable?

There are some changes in progress related to how the frontend bits are managed. I guess any issues around this should be workable later if needed.

mwheeler-ep · 2024-11-29T04:52:32Z

Thanks for the updates, this is really useful! Progress has been a bit slower than I hoped due to this being a lower priority task and few important things popping up.

Progress report

First cut at the exporter seems to be working
Backfilling issue sorted
TODO double check permissions side of things
TODO a bunch of tests

Backfilling

Understood - I've reworked my approach to fallback to integration when there's otherwise no value. This has made things easier as I don't need to write a migration for it 😅

Insights stuff

So I think I have the promethus / insights stuff mostly working.

The implementation I have right now uses a tuple with integration id + team id
This is probably fine where an alertgroup is involved I can iterate over the team list for the alert group

for integration in integrations:
    alert_group_teams = integration.alert_groups.values_list('teams', flat=True).distinct()
    for alert_group_team_id in alert_group_teams:

but then with many of the helper.py related functions we don't have an alertgroup to work with at the moment I'm doing for team in organization.teams.all():

For example:

for integration_id, service_data in states_diff.items():
    for team in organization.teams.all():

I'm not sure if this is a performance concern I should worry about or not. I'm presuming most orgs have tens of teams with larger orgs many having hundreds? But I don't really have the intel to know.

I could rework this to have another layer of cache which stores what teams the exporter needs to care about but I'm worried about the complexity that might bring.

matiasb · 2024-12-02T18:38:42Z

Thanks for the updates, this is really useful! Progress has been a bit slower than I hoped due to this being a lower priority task and few important things popping up.

Progress report
* First cut at the exporter seems to be working

* Backfilling issue sorted

* TODO double check permissions side of things

* TODO a bunch of tests

Nice! 👍

Backfilling

Understood - I've reworked my approach to fallback to integration when there's otherwise no value. This has made things easier as I don't need to write a migration for it 😅

Curious, the idea is that the integration team is lost if another team is added? How complex does the alert group team filtering logic get handling both cases?

Insights stuff

So I think I have the promethus / insights stuff mostly working.

If an alert group has more than a team associated, it will be counted for each team, right?

The implementation I have right now uses a tuple with integration id + team id This is probably fine where an alertgroup is involved I can iterate over the team list for the alert group
for integration in integrations:
    alert_group_teams = integration.alert_groups.values_list('teams', flat=True).distinct()
    for alert_group_team_id in alert_group_teams:
but then with many of the helper.py related functions we don't have an alertgroup to work with at the moment I'm doing for team in organization.teams.all():

For example:
for integration_id, service_data in states_diff.items():
    for team in organization.teams.all():
I'm not sure if this is a performance concern I should worry about or not. I'm presuming most orgs have tens of teams with larger orgs many having hundreds? But I don't really have the intel to know.

I could rework this to have another layer of cache which stores what teams the exporter needs to care about but I'm worried about the complexity that might bring.

In the above case, it may make sense to query all the teams in the org once, outside the loop and reuse that? Using some prefetch in the query may also help. But I could be missing context, do you have a WIP PR or branch to reference?

Thanks!

mwheeler-ep · 2024-12-02T22:42:44Z

Very much WIP - #5320 - I'm not even sure if things like exporter / metric calls for updating teams make sense with this approach but haven't had time to get my head around yet. And I haven't reviewed any of these changes.

Curious, the idea is that the integration team is lost if another team is added? How complex does the alert group team filtering logic get handling both cases?

I've placed the check in the api serialiser so the filtering logic doesn't really change. The limitation I can see with this approach is that right now you can't configure it to "remove" or I guess unset the team. We could have this show both teams (the integration + the alert group) easily as well, but I didn't feel like there was much of a use case for that?

If an alert group has more than a team associated, it will be counted for each team, right?

It should do! I haven't tested and confirmed it and I'm not very confident with my changes in this space yet.

mwheeler-ep · 2024-12-03T05:29:27Z

Curious, the idea is that the integration team is lost if another team is added? How complex does the alert group team filtering logic get handling both cases?

Just realised that I will need to add to the filtering logic in the view API. Will have a think about this

mwheeler-ep · 2025-01-06T02:06:29Z

I've updated the PR and put some comments around it. Would love some comments on how to move it to the next stage, or if anyone else would like to pick it up and run with it?

It works locally but I haven't used it in anger yet nor at scale. I also can't run a lot of the automated tooling due to not having a Docker licence / ability to run tilt.

mwheeler-ep added the feature request New feature or request label Nov 29, 2023

github-actions bot added the part:alert flow & configuration label Nov 29, 2023

iskhakov added the needs triage label May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set team on Alert Group based on route #3459

Set team on Alert Group based on route #3459

mwheeler-ep commented Nov 29, 2023

github-actions bot commented Nov 29, 2023

m4r1u2 commented Apr 9, 2024

donghoon-lee-mrt commented Jul 23, 2024

uschenk commented Aug 9, 2024 •

edited

Loading

Virenta commented Oct 3, 2024 •

edited

Loading

mwheeler-ep commented Oct 18, 2024 •

edited

Loading

mwheeler-ep commented Oct 20, 2024

matiasb commented Oct 22, 2024

uschenk commented Oct 22, 2024

mwheeler-ep commented Oct 28, 2024

mwheeler-ep commented Nov 18, 2024 •

edited

Loading

matiasb commented Nov 18, 2024

uschenk commented Nov 18, 2024

matiasb commented Nov 18, 2024

mwheeler-ep commented Nov 21, 2024

mwheeler-ep commented Nov 22, 2024

mwheeler-ep commented Nov 22, 2024

matiasb commented Nov 25, 2024

matiasb commented Nov 25, 2024

mwheeler-ep commented Nov 29, 2024

matiasb commented Dec 2, 2024

Progress report

Backfilling

Insights stuff

mwheeler-ep commented Dec 2, 2024 •

edited

Loading

mwheeler-ep commented Dec 3, 2024

mwheeler-ep commented Jan 6, 2025

Set team on Alert Group based on route #3459

Set team on Alert Group based on route #3459

Comments

mwheeler-ep commented Nov 29, 2023

What would you like to see!

Product Area

Anything else to add?

github-actions bot commented Nov 29, 2023

m4r1u2 commented Apr 9, 2024

donghoon-lee-mrt commented Jul 23, 2024

uschenk commented Aug 9, 2024 • edited Loading

Virenta commented Oct 3, 2024 • edited Loading

mwheeler-ep commented Oct 18, 2024 • edited Loading

mwheeler-ep commented Oct 20, 2024

matiasb commented Oct 22, 2024

uschenk commented Oct 22, 2024

mwheeler-ep commented Oct 28, 2024

mwheeler-ep commented Nov 18, 2024 • edited Loading

matiasb commented Nov 18, 2024

uschenk commented Nov 18, 2024

matiasb commented Nov 18, 2024

mwheeler-ep commented Nov 21, 2024

mwheeler-ep commented Nov 22, 2024

mwheeler-ep commented Nov 22, 2024

matiasb commented Nov 25, 2024

matiasb commented Nov 25, 2024

mwheeler-ep commented Nov 29, 2024

Progress report

Backfilling

Insights stuff

matiasb commented Dec 2, 2024

Progress report

Backfilling

Insights stuff

mwheeler-ep commented Dec 2, 2024 • edited Loading

mwheeler-ep commented Dec 3, 2024

mwheeler-ep commented Jan 6, 2025

uschenk commented Aug 9, 2024 •

edited

Loading

Virenta commented Oct 3, 2024 •

edited

Loading

mwheeler-ep commented Oct 18, 2024 •

edited

Loading

mwheeler-ep commented Nov 18, 2024 •

edited

Loading

mwheeler-ep commented Dec 2, 2024 •

edited

Loading