Skip to content

✨ Performance Alerting #2081

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dtfranz
Copy link
Contributor

@dtfranz dtfranz commented Jul 8, 2025

Description

Introduces an early-warning series of prometheus alerts to attempt to catch issues with performance at an early stage in development.

As the e2e tests run, the installed prometheus instance is scraping metrics from catalogd and operator-controller, and will fire alerts based on rules introduced in this PR. Since we're running these tests on the github runners which do not have consistent performance, our alerts must be based on platform-independent metrics and are therefore limited. Any other ideas for metrics to check on this PR are appreciated!

Once the e2e tests finish, prometheus is queried for active alerts. Any alerts found in pending state will result in a warning being set on the e2e workflow. Any alerts in firing state will give an error. These errors do not (at the moment) fail the run, but are visible when the workflow details are viewed.

For instance:

Prometheus Alert Pending
operator-controller-memory-growth: operator-controller pod memory usage growing at a high rate for 5 minutes: 72.86kB/sec

I am not making this a required check until we have a pretty good idea of an approximate baseline.

Potential Enhancements:

  • Additional alerts, if any
  • Fine-tune the alerts and fail runs when they fire
  • Remove yaml from script and organize into an additional kustomization component Done
  • Output metrics as a mermaid XY plot in the workflow summary

Closes #1904
Closes #1905

Reviewer Checklist

  • API Go Documentation
  • Tests: Unit Tests (and E2E Tests, if appropriate)
  • Comprehensive Commit Messages
  • Links to related GitHub Issue(s)

@dtfranz dtfranz requested a review from a team as a code owner July 8, 2025 14:50
@openshift-ci openshift-ci bot requested review from perdasilva and trgeiger July 8, 2025 14:50
@dtfranz dtfranz force-pushed the metrics-alerting branch 2 times, most recently from 97a268f to cb81424 Compare July 8, 2025 14:53
Copy link

netlify bot commented Jul 8, 2025

Deploy Preview for olmv1 ready!

Name Link
🔨 Latest commit 7cd03f1
🔍 Latest deploy log https://app.netlify.com/projects/olmv1/deploys/686f74ebf4b3650008306928
😎 Deploy Preview https://deploy-preview-2081--olmv1.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@dtfranz dtfranz force-pushed the metrics-alerting branch from cb81424 to bb8a597 Compare July 9, 2025 02:41
Copy link

codecov bot commented Jul 9, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 73.37%. Comparing base (1333f7b) to head (7cd03f1).
Report is 9 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2081      +/-   ##
==========================================
+ Coverage   73.35%   73.37%   +0.01%     
==========================================
  Files          77       77              
  Lines        7056     7076      +20     
==========================================
+ Hits         5176     5192      +16     
- Misses       1540     1543       +3     
- Partials      340      341       +1     
Flag Coverage Δ
e2e 44.90% <ø> (-0.08%) ⬇️
experimental-e2e 51.11% <ø> (-0.24%) ⬇️
unit 58.38% <ø> (+0.10%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dtfranz dtfranz force-pushed the metrics-alerting branch from bb8a597 to 3339f47 Compare July 9, 2025 07:11
Introduces an early-warning series of prometheus alerts to attempt to catch issues with performance at an early stage in development.

Signed-off-by: Daniel Franz <[email protected]>
@dtfranz dtfranz force-pushed the metrics-alerting branch from 3339f47 to 7cd03f1 Compare July 10, 2025 08:08
@trgeiger
Copy link
Contributor

I think it looks good but does it really close #1905? I would think that issue is more specific to a future iteration of this feature where we do make the job fail if it hits certain thresholds.

@trgeiger
Copy link
Contributor

One other thing is I don't have context for the thresholds you chose in the alerts. I see you mention that you don't want this to be required until we have a firmer idea of good baselines--are the current ones based on some previous work or did you just pick some decent-seeming thresholds for all the checks based on your experience? Do we need to queue up additional work to fine-tune these?

@dtfranz
Copy link
Contributor Author

dtfranz commented Jul 11, 2025

Thanks for taking a look @trgeiger !

For your first point, I agree that the issue definitely indicates that we should fail the CI but I'm hesitant to do that at the moment without larger group buy-in. I'm happy to keep the issue open and close it after we turn on ci blocking, or close it with this PR and track a follow-up issue. As long as it's tracked I'm happy either way.

On your second point, these values are based on my experience running the workflow many times over and checking the metrics. Up till this point, nobody (to my knowledge) has run a more thorough study of v1 performance, not counting @jianzhangbjz and his work on the downstream version of this. These changes will enable to us to quickly get a better understanding and make any necessary adjustments.

@jianzhangbjz
Copy link

Yeah, I haven’t collected the data for the OLMv1 performance baseline yet. I’m planning to reuse https://github.com/cloud-bulldozer/orion to help identify performance issues. The current metrics are being discussed on Slack: link, and progress is being tracked here: https://issues.redhat.com/browse/OCPQE-28161

@trgeiger
Copy link
Contributor

Cool, that's exactly what I wanted to know re: thresholds. And as for the issue tracking, either solution works--I just wanted to make sure the next iteration was tracked, as you stated. Keeping that issue or opening a new one both sound good to me.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 11, 2025
Copy link

openshift-ci bot commented Jul 11, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: trgeiger
Once this PR has been reviewed and has the lgtm label, please assign joelanford for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

annotations:
description: "container {{ $labels.container }} of pod {{ $labels.pod }} experienced OOM event(s); count={{ $value }}"
- alert: operator-controller-memory-growth
expr: deriv(sum(container_memory_working_set_bytes{pod=~"operator-controller.*",container="manager"})[5m:]) > 50_000
Copy link
Contributor

@camilamacedo86 camilamacedo86 Jul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dtfranz so we are manually defining the trashholders here?
Could we doc how it works in the https://github.com/operator-framework/operator-controller/blob/main/docs/contribute/developer.md ? WDYT?
Not a blocker for this one for sure

$(KUSTOMIZE) build config/prometheus | CATALOGD_SERVICE_CERT=$(shell kubectl get certificate -n olmv1-system catalogd-service-cert -o jsonpath={.spec.secretName}) envsubst '$$CATALOGD_SERVICE_CERT' | kubectl apply -f -
kubectl wait --for=condition=Ready pods -n $(PROMETHEUS_NAMESPACE) -l app.kubernetes.io/name=prometheus-operator --timeout=60s
kubectl wait --for=create pods -n $(PROMETHEUS_NAMESPACE) prometheus-prometheus-0 --timeout=60s
kubectl wait --for=condition=Ready pods -n $(PROMETHEUS_NAMESPACE) prometheus-prometheus-0 --timeout=120s
Copy link
Contributor

@camilamacedo86 camilamacedo86 Jul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to centralise the Prometheus installation and related configurations in the hack directory? It might help keep things more organised and easier to understand.

Copy link
Contributor

@camilamacedo86 camilamacedo86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally okay with the approach here, and we can continue improving it step by step through follow-ups. (I added just one nit). Otherwise, LGTM

Honestly, I prefer this incremental method — it also makes it easier for others to contribute along the way. I think it would be nice if we could get a review from @tmshort as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Select detection criteria for CI failure of e2e metrics job Create/Modify upstream CI job
5 participants