Skip to content

chore(perf-issues): Add the docs for new experiments #91589

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 54 additions & 7 deletions src/sentry/utils/performance_issues/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,13 @@ Performance Issues are built on top of the [Issue Platform](https://develop.sent

For context, the Issue Platform operates off of the the GroupType subclasses (from [group_type.py](../../issues/grouptype.py)). The [issue platform docs](https://develop.sentry.dev/backend/issue-platform/#releasing-your-issue-type) provide a way to control the rollout of new `GroupType`s via the `released` property. Keep in mind, **this rollout is entirely separate from the PerformanceDetector**! The checks within `_detect_performance_problem` may skip running the detector, or skip creating problems completely all before they ever reach the Issue Platform.

### Detector Assumptions

The current performance detectors make some assumptions about the spans that are passed along to `visit_span()` when they are traversing the transaction event and scanning for problems. It's possible some detectors may assemble their own data structure internally, but generally performance detectors assume:

- The list of spans are in a flat, sequential hierarchy, the same way they are presented in the trace view.
- The spans are fixed to one transaction. They do not persist data between separate events or handle streaming, everything is in memory.

## Adding a Detector

There are quite a few places which need to be updated when adding a new performance detector:
Expand All @@ -39,7 +46,7 @@ There are quite a few places which need to be updated when adding a new performa
- [ ] Add a key to `DetectorType` in [base.py](./base.py)
- [ ] Register a `performance.issues.<detector_name>.problem-creation` option in [defaults.py](../../options/defaults.py)
- [ ] Add an entry to `DETECTOR_TYPE_ISSUE_CREATION_TO_SYSTEM_OPTION` in [base.py](./base.py) with that new option
- [ ] Create a new GroupType in [grouptype.py](../../issues/grouptype.py)
- [ ] Create a new `GroupType` in [grouptype.py](../../issues/grouptype.py)
- [ ] Update `get_detection_settings()` (in [performance_detection.py](./performance_detection.py))
- [ ] Add a key for your `DetectorType`, with a value of an empty dictionary
- [ ] If your value is not customizable, add it to the dictionary
Expand All @@ -59,13 +66,53 @@ There are quite a few places which need to be updated when adding a new performa
- [ ] Implement `visit_span()` and `on_complete()`, adding any identified `PerformanceProblem`s to `self.stored_problems` as you go.
- [ ] Leverage the [Writing Detectors docs](https://develop.sentry.dev/backend/issue-platform/writing-detectors/) which can help guide your detector's design.

## Adding an Experiment
## Running Experiments

TODO(Leander): Fill this in.
Since performance detectors have such high throughput (by operating on all ingested spans), changes to things like logical flow, or fingerprinting can have huge unintended consequences. We do the best we can to write tests, but it's not realistic to test every possible edge case that we'll see in production. Instead we have to be strategic about making these sorts of edits.

### Detector Assumptions
Currently, we have a loose system called `experiments`, which can be found in the [/detectors/experiments](./detectors/experiments/) and [tests/.../performance_issues/experiments](../../../../tests/sentry/utils/performance_issues/experiments/) directories. Experiments are useful for when we want to make a risky change to a live detector, but want to double check the output before releasing it. You can run a new experiment by following these steps:

The current performance detectors make some assumptions about the spans that are passed along to `visit_span()` when they are traversing the transaction event and scanning for problems. It's possible some detectors may assemble their own data structure internally, but generally performance detectors assume:
### Setup the Experiment

- The list of spans are in a flat, sequential hierarchy, the same way they are presented in the trace view.
- The spans are fixed to one transaction. They do not persist data between separate events or handle streaming, everything is in memory.
- [ ] Create a new `GroupType` in [grouptype.py](../../issues/grouptype.py). By convention, the `type_id` uses 19xx (e.g. 1002 -> 1902), and the `slug` and `description` include 'experimental'. **Ensure `released` is set to False**
- [ ] Add a key to `DetectorType` in [base.py](./base.py).
- [ ] Update `get_detection_settings()` (in [performance_detection.py](./performance_detection.py)) with the new experimental `DetectorType` key. The value can stay as a copy of the existing detector.
- [ ] Copy/paste the existing detector into [/detectors/experiments](./detectors/experiments/)
- [ ] Rename the detector class to include `Experimental`
- [ ] Swap usage of the existing `DetectorType` and `GroupType` to the experimental ones
- [ ] If you want to immediately start running the detector on all production data, override `is_detection_allowed_for_system`, otherwise you'll have to meter the detector rollout with an entry to `DETECTOR_TYPE_ISSUE_CREATION_TO_SYSTEM_OPTION` in [base.py](./base.py)
- [ ] **Ensure fingerprints do not collide with the existing detector**
- A lot of detectors use `GroupType` for fingerprinting, but if not, make a temporary change manually. If you don't do this, then once you start ingesting production data, the groups will be identical to what they would be if you GA'd, but without running `post_process`. This will mean once GA'd, the issue platform sees that the groups _already exist_ and so won't fire notifications, trigger alerts, assign correctly, or any of the other steps from `post_process`.
- [ ] Add the experimental `PerformanceDetector` to `DETECTOR_CLASSES` in [performance_detection.py](./performance_detection.py)
- [ ] Next, copy/paste the detector test into [tests/.../performance_issues/experiments](../../../../tests/sentry/utils/performance_issues/experiments/)
- [ ] Again, swap the `DetectorType` and `GroupType` to the experimental ones
- [ ] Correct any failing tests in the new experiment file
- [ ] Correct any failing tests in the existing file. You may be able to just use the [`@exclude_experimental_detectors()`](../../testutils/performance_issues/experiments.py) decorator for a quick edit.

### Making Changes

Once all the above is merged, we can begin functional changes. You can try out new detection schemes, strategies or fingerprinting, without any impact on user experience. You'll also be able to add tests, and iterate on the changes as much as you want without affecting the existing detector.

One note though, you may want to keep an eye on the metrics for `run_detector_on_data.<detector-name>` to ensure it doesn't unreasonably increase the runtime compared to the existing detector.

To start ingesting user data, you can use the issue platform [release flags](https://develop.sentry.dev/backend/issue-platform/#releasing-your-issue-type):

```sh
sentry createissueflag --slug=<experiment-grouptype-slug> --owner=<your-team>
```

You'll only want to GA the 'ingest' flag, but the 'post-process' and 'ui' flags can help you get an end to end experience prior to adding your changes to the existing detector.

Once the change has been deemed sufficiently tested, reverse the setup:

- [ ] Copy/paste the experiment into [/detectors/](./detectors/), replacing the original
- [ ] Remove `Experimental` from the detector class name
- [ ] Swap usage of the experimental `DetectorType` and `GroupType` to the existing ones
- [ ] Remove the override for `is_detection_allowed_for_system` if applicable.
- [ ] **Revert any manual changes to the fingerprinting**
- [ ] Next, copy/paste the experiment test into [tests/.../performance_issues/](../../../../tests/sentry/utils/performance_issues/), replacing the original
- [ ] Again, swap the experimental `DetectorType` and `GroupType` to the existing ones
- [ ] Correct any failing tests in the file

Once all that is good to go, ensure you clean up the experiment files and any options/flags you may have set
while iterating.
Loading