Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set team on Alert Group based on route #3459

Open
mwheeler-ep opened this issue Nov 29, 2023 · 24 comments
Open

Set team on Alert Group based on route #3459

mwheeler-ep opened this issue Nov 29, 2023 · 24 comments

Comments

@mwheeler-ep
Copy link

What would you like to see!

An integration might be used by multiple teams and then routed to the correct team using the routes/channel filters. The team for the alert group however seems to be set by the integration team setting and there is no way to change the team during routing.

Product Area

Alert Flow & Configuration

Anything else to add?

Alternatively this might be something that could be added to an escalation step?

Copy link
Contributor

The current version of Grafana OnCall, at the time this issue was opened, is v1.3.64. If your issue pertains to an older version of Grafana OnCall, please be sure to list it in the PR description. Thank you 😄!

@m4r1u2
Copy link
Contributor

m4r1u2 commented Apr 9, 2024

We also like this feature!
As part of the routing rules at the integration level, route a alert to a team (optional) and a Escalation chain.

@donghoon-lee-mrt
Copy link

We really want this feature as well!

@uschenk
Copy link

uschenk commented Aug 9, 2024

We really need this feature. We cover 12 teams in one integration and the Alert Groups are all not assigned to a team.

Only workaround sort of is assigning Labels to the alert groups.
But insight/reports can not display alert groups grouped by labels either.

@Virenta
Copy link

Virenta commented Oct 3, 2024

We are in the same boat, we manage single integration, but routing to multiple teams.

Implement it please!

@mwheeler-ep
Copy link
Author

mwheeler-ep commented Oct 18, 2024

Grafana team, it might be possible that I could get some work done for this ticket.

There's two approaches (though not mutually exclusive, so both could be implemented if we wanted).

Inherit the team from the escalation chain

  • Escalation chains already assigned a team, so when an alert group is routed to an escalation chain we could assign that team
  • Existing Grafana deploys will be impacted by this change and might have unexpected changes?

Escalation step

  • Add a step that allows changing the team as part of the escalation chain
  • Requires users to add a step to every escalation chain (not a problem for us because terraform goes brrrrr... but might be annoying for people who don't IaC oncall)

I'd love some advice as to which you'd prefer implemented first or any other changes so that if we do get some time to work on this its ready to go.

@mwheeler-ep
Copy link
Author

Thinking about the inherit from escalation chain approach and I think it could be implemented in a way that doesn't break existing default behaviour by making it an option on the "trigger escalation chain"
Image

@iskhakov @matiasb - thoughts?

@matiasb
Copy link
Contributor

matiasb commented Oct 22, 2024

Hi!

Thanks for filing the issue, the details and research around this.

Changing the team assigned to an alert group may not be that simple. Right now there is no team stored along an alert group, but it is got from (and filtered via) the integration it was received. In that sense, it seems it should be possible to switch this to use escalation chain team instead (but as you noted that would mean a behavior change that could have unexpected results, besides not every alert group will have an escalation chain necessarily).

A different alternative I can think of is to make it possible to associate multiple teams to an alert group, so you can have the team from the integration, but also inherit the team from the escalation chain (if different) and even add teams arbitrarily in the future (we would need to discuss this with the team).

OTOH, and to understand the use case (and the possible solutions), is this mostly for tracking metrics / filtering alert groups? (right now that's what you get by setting up teams) what other scenarios you are trying to solve? Thanks!

@uschenk
Copy link

uschenk commented Oct 22, 2024

Besides using the escalation chain, another option to assign the team name to an Alert Group could be within

  • the integration itself as an option of each routing.
  • inside the integration step using a script in the template processing to assign a team
...to associate multiple teams to an alert group
...alternative I can think of is to make it possible to associate multiple teams to an alert group...

Besides reporting/filtering, there is also a permission piece to it. Every record (integration record and alert groups that have no team assigned are visible and accessible by everybody who has basic permissions. Restricting access to the users assigned specific team requires a team name for these records.

I am actually also annoyed about integrations labeled as "No Team" as well. These integrations are accessible by ALL users with basic integration permissions.

@mwheeler-ep
Copy link
Author

Filtering and tracking metrics is the key outcome we are looking for. Hadn't considered the permissions side of things as it doesn't apply to our use case but certainly seems like something that needs resolving as well.

@mwheeler-ep
Copy link
Author

mwheeler-ep commented Nov 18, 2024

@matiasb I've been working through / thinking about implementation of this again and wondering a rough where to start. My thoughts based on your comments are as follows

  1. Since alertgroups currently aren't assigned a team - rather a team is grabbed from the integration for the alertgroup - the first change to make would be adding a team property to alertgroups. To begin with this would use the current method of using the integration to assign the team. We should make this property a list of teams* rather than single team to support possible future multi team alertgroups, however this doesn't need to be implemented straight away.
  2. From here we could implement multiple solutions - the one I proposed with the option on the escalation chain, or as an escalation step. The first version of this feature could be to replace the team, rather than append to the list.

* I believe the metrics exporter exports a single team property, so for multi team support a change here would need to be considered.

Would it make sense to start working on point 1 while point 2 is being figured out?

Btw I joined the community slack under @Michaela Wheeler username, though I'm not sure our work hours overlap. Happy to discuss there if it helps :)

@matiasb
Copy link
Contributor

matiasb commented Nov 18, 2024

@matiasb I've been working through / thinking about implementation of this again and wondering a rough where to start. My thoughts based on your comments are as follows

  1. Since alertgroups currently aren't assigned a team - rather a team is grabbed from the integration for the alertgroup - the first change to make would be adding a team property to alertgroups. To begin with this would use the current method of using the integration to assign the team. We should make this property a list of teams* rather than single team to support possible future multi team alertgroups, however this doesn't need to be implemented straight away.

Makes sense 👍 I would still define a M2M model between alert group and team, even if we restrict one team per alert group at first.

  1. From here we could implement multiple solutions - the one I proposed with the option on the escalation chain, or as an escalation step. The first version of this feature could be to replace the team, rather than append to the list.

Right, having the model decided and defined we should be able to work on any of these (or variations) of this possible paths.

  • I believe the metrics exporter exports a single team property, so for multi team support a change here would need to be considered.

That's a good point, and this should probably be part of the work in point 1 (ie. changing the model will require us to update how we calculate the metrics; also the team filter for insights is now based on the team from the integration, so some extra work may be needed to keep things consistent).

Would it make sense to start working on point 1 while point 2 is being figured out?

Sounds good!

Btw I joined the community slack under @Michaela Wheeler username, though I'm not sure our work hours overlap. Happy to discuss there if it helps :)

I see :-) feel free to ping me there to talk any details (I'm @matiasb there too)

Thanks for pushing this forward!

@uschenk
Copy link

uschenk commented Nov 18, 2024

I am not familiar with your code nor OnCall internals, but adding support for multiple teams per Alert Group sounds to me
like it needs to reviewed with the current security model in mind, because users are assigned to teams. What if AlertGroup permissions on team level contradict one team vs the other team vs. the user level?

@matiasb
Copy link
Contributor

matiasb commented Nov 18, 2024

I am not familiar with your code nor OnCall internals, but adding support for multiple teams per Alert Group sounds to me like it needs to reviewed with the current security model in mind, because users are assigned to teams. What if AlertGroup permissions on team level contradict one team vs the other team vs. the user level?

Yeah, in any case I wouldn't enable multiple teams support at first but I would try to keep that in mind.

About permissions, giving it a quick thought, I think as long as one of the teams allow you to access the alert group, you should be able to see it (right now there is no perms per team, only allow users outside the team to access things or not; per-user perms work at a global level too, ie. you cannot restrict access to an integration or schedule).

@mwheeler-ep
Copy link
Author

Update on this so far:

  • Have development env running again
  • Have updated the model to be m2m team for alertgroup
  • Assigning team as per the integration (like what currently happens)
  • API view for filtering on the team is working

Todo / things to consider:

  • Adding to the migration to update old alert groups with team based on integration ? Alternatively can leave the old code there, however I don't think thats a good solution and will likely cause more issues than it solves.
  • Prometheus exporter - will have to have a think about this - for now if we want to we could just grab the first value. I'll see what options there are.
  • Permissions - I'm not too familiar with team based permissions - would the intention here be adding a team based permission check that doesn't already exist? That may change current behaviour for users if I'm understanding correctly? @matiasb can you explain what needs to be done here? If I retain available_teams_lookup_arg functionality will it be ok?

@mwheeler-ep
Copy link
Author

Was able to get a little bit more of this done.
Updated the UI to support multiple teams in alert groups - even though initially this feature won't be used
Image
At the moment I've implemented my initial - update alert group toggle for routes / escalation chains.
Image

Public api has been updated to have teams field. I've left the original team_id but documented it in the api docs the difference.

@mwheeler-ep
Copy link
Author

One thing I noticed is that performing pnpm generate-types generates a lot of changes and results in an unbuilding frontend. So far I have manually added in changes to workaround this issue. Not sure if this is acceptable / workable?

@matiasb
Copy link
Contributor

matiasb commented Nov 25, 2024

Update on this so far:

* Have development env running again

* Have updated the model to be m2m team for alertgroup

* Assigning team as per the integration (like what currently happens)

* API view for filtering on the team is working

Nice!

Todo / things to consider:

* Adding to the migration to update old alert groups with team based on integration ? Alternatively can leave the old code there, however I don't think thats a good solution and will likely cause more issues than it solves.

Backfilling existing alert groups could be not that simple for setups where you have millions of alert groups, so not requiring a migration or making it online somehow could be better.

* Prometheus exporter - will have to have a think about this - for now if we want to we could just grab the first value. I'll see what options there are.

👍

* Permissions - I'm not too familiar with team based permissions - would the intention here be adding a team based permission check that doesn't already exist? That may change current behaviour for users if I'm understanding correctly? [@matiasb](https://github.com/matiasb) can you explain what needs to be done here? If I retain `available_teams_lookup_arg` functionality will it be ok?

Right now each team can decide if their resources are only visible to team members or anyone in the organization (via Settings -> Team and Access Settings). I think that if an alert group belongs to a team that allows anyone to get access, then it should be possible to check that alert group details (otherwise, you should be required to be member of any of the associated teams). What do you think? In any case, handling multiple teams can be left for a future iteration too.

@matiasb
Copy link
Contributor

matiasb commented Nov 25, 2024

One thing I noticed is that performing pnpm generate-types generates a lot of changes and results in an unbuilding frontend. So far I have manually added in changes to workaround this issue. Not sure if this is acceptable / workable?

There are some changes in progress related to how the frontend bits are managed. I guess any issues around this should be workable later if needed.

@mwheeler-ep
Copy link
Author

Thanks for the updates, this is really useful! Progress has been a bit slower than I hoped due to this being a lower priority task and few important things popping up.

Progress report

  • First cut at the exporter seems to be working
  • Backfilling issue sorted
  • TODO double check permissions side of things
  • TODO a bunch of tests

Backfilling

Understood - I've reworked my approach to fallback to integration when there's otherwise no value. This has made things easier as I don't need to write a migration for it 😅

Insights stuff

So I think I have the promethus / insights stuff mostly working.

Image
Image

The implementation I have right now uses a tuple with integration id + team id
This is probably fine where an alertgroup is involved I can iterate over the team list for the alert group

for integration in integrations:
    alert_group_teams = integration.alert_groups.values_list('teams', flat=True).distinct()
    for alert_group_team_id in alert_group_teams:

but then with many of the helper.py related functions we don't have an alertgroup to work with at the moment I'm doing for team in organization.teams.all():

For example:

for integration_id, service_data in states_diff.items():
    for team in organization.teams.all():

I'm not sure if this is a performance concern I should worry about or not. I'm presuming most orgs have tens of teams with larger orgs many having hundreds? But I don't really have the intel to know.

I could rework this to have another layer of cache which stores what teams the exporter needs to care about but I'm worried about the complexity that might bring.

@matiasb
Copy link
Contributor

matiasb commented Dec 2, 2024

Thanks for the updates, this is really useful! Progress has been a bit slower than I hoped due to this being a lower priority task and few important things popping up.

Progress report

* First cut at the exporter seems to be working

* Backfilling issue sorted

* TODO double check permissions side of things

* TODO a bunch of tests

Nice! 👍

Backfilling

Understood - I've reworked my approach to fallback to integration when there's otherwise no value. This has made things easier as I don't need to write a migration for it 😅

Curious, the idea is that the integration team is lost if another team is added? How complex does the alert group team filtering logic get handling both cases?

Insights stuff

So I think I have the promethus / insights stuff mostly working.

Image Image

If an alert group has more than a team associated, it will be counted for each team, right?

The implementation I have right now uses a tuple with integration id + team id This is probably fine where an alertgroup is involved I can iterate over the team list for the alert group

for integration in integrations:
    alert_group_teams = integration.alert_groups.values_list('teams', flat=True).distinct()
    for alert_group_team_id in alert_group_teams:

but then with many of the helper.py related functions we don't have an alertgroup to work with at the moment I'm doing for team in organization.teams.all():

For example:

for integration_id, service_data in states_diff.items():
    for team in organization.teams.all():

I'm not sure if this is a performance concern I should worry about or not. I'm presuming most orgs have tens of teams with larger orgs many having hundreds? But I don't really have the intel to know.

I could rework this to have another layer of cache which stores what teams the exporter needs to care about but I'm worried about the complexity that might bring.

In the above case, it may make sense to query all the teams in the org once, outside the loop and reuse that? Using some prefetch in the query may also help. But I could be missing context, do you have a WIP PR or branch to reference?

Thanks!

@mwheeler-ep
Copy link
Author

mwheeler-ep commented Dec 2, 2024

Very much WIP - #5320 - I'm not even sure if things like exporter / metric calls for updating teams make sense with this approach but haven't had time to get my head around yet. And I haven't reviewed any of these changes.

Curious, the idea is that the integration team is lost if another team is added? How complex does the alert group team filtering logic get handling both cases?

I've placed the check in the api serialiser so the filtering logic doesn't really change. The limitation I can see with this approach is that right now you can't configure it to "remove" or I guess unset the team. We could have this show both teams (the integration + the alert group) easily as well, but I didn't feel like there was much of a use case for that?

If an alert group has more than a team associated, it will be counted for each team, right?

It should do! I haven't tested and confirmed it and I'm not very confident with my changes in this space yet.

@mwheeler-ep
Copy link
Author

Curious, the idea is that the integration team is lost if another team is added? How complex does the alert group team filtering logic get handling both cases?

Just realised that I will need to add to the filtering logic in the view API. Will have a think about this

@mwheeler-ep
Copy link
Author

I've updated the PR and put some comments around it. Would love some comments on how to move it to the next stage, or if anyone else would like to pick it up and run with it?

It works locally but I haven't used it in anger yet nor at scale. I also can't run a lot of the automated tooling due to not having a Docker licence / ability to run tilt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants