-
Notifications
You must be signed in to change notification settings - Fork 66
✨ Performance Alerting #2081
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ Performance Alerting #2081
Conversation
97a268f
to
cb81424
Compare
✅ Deploy Preview for olmv1 ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2081 +/- ##
==========================================
+ Coverage 73.35% 73.45% +0.09%
==========================================
Files 77 78 +1
Lines 7056 7240 +184
==========================================
+ Hits 5176 5318 +142
- Misses 1540 1569 +29
- Partials 340 353 +13
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
3339f47
to
7cd03f1
Compare
I think it looks good but does it really close #1905? I would think that issue is more specific to a future iteration of this feature where we do make the job fail if it hits certain thresholds. |
One other thing is I don't have context for the thresholds you chose in the alerts. I see you mention that you don't want this to be required until we have a firmer idea of good baselines--are the current ones based on some previous work or did you just pick some decent-seeming thresholds for all the checks based on your experience? Do we need to queue up additional work to fine-tune these? |
Thanks for taking a look @trgeiger ! For your first point, I agree that the issue definitely indicates that we should fail the CI but I'm hesitant to do that at the moment without larger group buy-in. I'm happy to keep the issue open and close it after we turn on ci blocking, or close it with this PR and track a follow-up issue. As long as it's tracked I'm happy either way. On your second point, these values are based on my experience running the workflow many times over and checking the metrics. Up till this point, nobody (to my knowledge) has run a more thorough study of v1 performance, not counting @jianzhangbjz and his work on the downstream version of this. These changes will enable to us to quickly get a better understanding and make any necessary adjustments. |
Yeah, I haven’t collected the data for the OLMv1 performance baseline yet. I’m planning to reuse https://github.com/cloud-bulldozer/orion to help identify performance issues. The current metrics are being discussed on Slack: link, and progress is being tracked here: https://issues.redhat.com/browse/OCPQE-28161 |
Cool, that's exactly what I wanted to know re: thresholds. And as for the issue tracking, either solution works--I just wanted to make sure the next iteration was tracked, as you stated. Keeping that issue or opening a new one both sound good to me. |
annotations: | ||
description: "container {{ $labels.container }} of pod {{ $labels.pod }} experienced OOM event(s); count={{ $value }}" | ||
- alert: operator-controller-memory-growth | ||
expr: deriv(sum(container_memory_working_set_bytes{pod=~"operator-controller.*",container="manager"})[5m:]) > 50_000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dtfranz so we are manually defining the trashholders here?
Could we doc how it works in the https://github.com/operator-framework/operator-controller/blob/main/docs/contribute/developer.md ? WDYT?
Not a blocker for this one for sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you talking about adding notes to explain these specific queries? I'm happy to do that somewhere if so, but if you mean explaining how to adjust/create rules I'd prefer to link to the official prometheus docs.
$(KUSTOMIZE) build config/prometheus | CATALOGD_SERVICE_CERT=$(shell kubectl get certificate -n olmv1-system catalogd-service-cert -o jsonpath={.spec.secretName}) envsubst '$$CATALOGD_SERVICE_CERT' | kubectl apply -f - | ||
kubectl wait --for=condition=Ready pods -n $(PROMETHEUS_NAMESPACE) -l app.kubernetes.io/name=prometheus-operator --timeout=60s | ||
kubectl wait --for=create pods -n $(PROMETHEUS_NAMESPACE) prometheus-prometheus-0 --timeout=60s | ||
kubectl wait --for=condition=Ready pods -n $(PROMETHEUS_NAMESPACE) prometheus-prometheus-0 --timeout=120s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be better to centralise the Prometheus installation and related configurations in the hack directory? It might help keep things more organised and easier to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I would prefer it to be part of the existing e2e manifests, since this is something we are planning to do for our e2e's.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script was getting way too big, and only had one or two operations that justified it as a script. The prometheus yaml is way more readable and maintainable in a kustomization manifest, IMO.
@tmshort I initially wanted to add these manifests to the e2e collection as you mentioned, but the catalogd certificate generated by certmanager is named catalogd-service-cert-v1.3.0-68-g7cd03f1-dirty
(at least, for me), which doesn't give me confidence that I can just hard-code it and hope it keeps working. Unless you know what exactly that name comes from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Figured out the secret name stuff but ran into other issues; see here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we change it for something like:
Makefile
.PHONY: prometheus
PROMETHEUS_NAMESPACE := olmv1-system
PROMETHEUS_VERSION := v0.83.0
prometheus:
./hack/promotheus/deploy-prometheus.sh $(PROMETHEUS_NAMESPACE) $(PROMETHEUS_VERSION)
#!/bin/bash
set -euo pipefail
PROMETHEUS_NAMESPACE="$1"
PROMETHEUS_VERSION="$2"
KUSTOMIZE="${KUSTOMIZE:-kustomize}"
VERSION="${VERSION:-latest}"
TMPDIR="$(mktemp -d)"
trap 'echo "Cleaning up $TMPDIR"; rm -rf "$TMPDIR"' EXIT
echo "Downloading Prometheus resources..."
curl -s "https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/refs/tags/${PROMETHEUS_VERSION}/kustomization.yaml" > "${TMPDIR}/kustomization.yaml"
curl -s "https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/refs/tags/${PROMETHEUS_VERSION}/bundle.yaml" > "${TMPDIR}/bundle.yaml"
echo "Patching namespace to ${PROMETHEUS_NAMESPACE}..."
(cd "$TMPDIR" && $KUSTOMIZE edit set namespace "$PROMETHEUS_NAMESPACE")
echo "Applying Prometheus base..."
kubectl create -k "$TMPDIR"
echo "Waiting for Prometheus Operator pods to become ready..."
kubectl wait --for=condition=Ready pods -n "$PROMETHEUS_NAMESPACE" -l app.kubernetes.io/name=prometheus-operator
echo "Applying overlay config..."
$KUSTOMIZE build config/overlays/prometheus | sed "s/cert-git-version/cert-${VERSION}/g" | kubectl apply -f -
echo "Rechecking readiness..."
kubectl wait --for=condition=Ready pods -n "$PROMETHEUS_NAMESPACE" -l app.kubernetes.io/name=prometheus-operator --timeout=60s
kubectl wait --for=create pods -n "$PROMETHEUS_NAMESPACE" prometheus-prometheus-0 --timeout=60s
kubectl wait --for=condition=Ready pods -n "$PROMETHEUS_NAMESPACE" prometheus-prometheus-0 --timeout=120s
echo "Prometheus deployment completed successfully."
There is too much in the Makefile, which is starting to become hard for us to understand and maintain.
IHMO the logic to install and check would better fit in a shell
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm generally okay with the approach here, and we can continue improving it step by step through follow-ups. (I added just one nit). Otherwise, LGTM
Honestly, I prefer this incremental method — it also makes it easier for others to contribute along the way. I think it would be nice if we could get a review from @tmshort as well.
/hold |
$(KUSTOMIZE) build config/prometheus | CATALOGD_SERVICE_CERT=$(shell kubectl get certificate -n olmv1-system catalogd-service-cert -o jsonpath={.spec.secretName}) envsubst '$$CATALOGD_SERVICE_CERT' | kubectl apply -f - | ||
kubectl wait --for=condition=Ready pods -n $(PROMETHEUS_NAMESPACE) -l app.kubernetes.io/name=prometheus-operator --timeout=60s | ||
kubectl wait --for=create pods -n $(PROMETHEUS_NAMESPACE) prometheus-prometheus-0 --timeout=60s | ||
kubectl wait --for=condition=Ready pods -n $(PROMETHEUS_NAMESPACE) prometheus-prometheus-0 --timeout=120s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I would prefer it to be part of the existing e2e manifests, since this is something we are planning to do for our e2e's.
Makefile
Outdated
curl -s "https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/refs/tags/$(PROMETHEUS_VERSION)/bundle.yaml" > "$(TMPDIR)/bundle.yaml"; \ | ||
(cd $(TMPDIR) && $(KUSTOMIZE) edit set namespace $(PROMETHEUS_NAMESPACE)) && kubectl create -k "$(TMPDIR)" | ||
kubectl wait --for=condition=Ready pods -n $(PROMETHEUS_NAMESPACE) -l app.kubernetes.io/name=prometheus-operator | ||
$(KUSTOMIZE) build config/prometheus | CATALOGD_SERVICE_CERT=$(shell kubectl get certificate -n olmv1-system catalogd-service-cert -o jsonpath={.spec.secretName}) envsubst '$$CATALOGD_SERVICE_CERT' | kubectl apply -f - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name of this secret ought to be fixed, so you shouldn't have to extract it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This secret is generated by certmanager at runtime, after installation, so it can't be predetermined (unless you know of a way). EDIT: I've noticed the name does seem to stay as catalogd-service-cert-v1.3.0-68-g7cd03f1-dirty
for me, but without knowing how exactly that name is generated I'd rather not set it as such here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NVM, I found it, I'll try and work that in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK so it's possible to do the same sed "s/cert-git-version/cert-$(VERSION)/g"
for these manifests, but the next issue we face putting the manifests in config/components/e2e
is that it's not possible to put the prometheus CRDs and CRs into the same manifest that gets installed by install.tpl.sh
, otherwise you get no matches for kind "Prometheus"
errors. The prometheus-operator install and metrics yaml need to happen at different steps. That probably means adding additional content to the install.tpl.sh
script.
Unless you know a better way, I'm thinking I could add this to install.tpl.sh
:
if [[ -n "$prometheus_version" ]]; then
trap 'echo "Cleaning up $(TMPDIR)"; rm -rf "$(TMPDIR)"' EXIT; \
curl -s "https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/refs/tags/$(PROMETHEUS_VERSION)/kustomization.yaml" > "$(TMPDIR)/kustomization.yaml"; \
curl -s "https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/refs/tags/$(PROMETHEUS_VERSION)/bundle.yaml" > "$(TMPDIR)/bundle.yaml"; \
(cd $(TMPDIR) && $(KUSTOMIZE) edit set namespace $(PROMETHEUS_NAMESPACE)) && kubectl create -k "$(TMPDIR)"
kubectl wait --for=condition=Ready pods -n $(PROMETHEUS_NAMESPACE) -l app.kubernetes.io/name=prometheus-operator
fi
...then have the Makefile set the variable in test-e2e
here:
test-e2e: PROMETHEUS_VERSION := v0.83.0
I kind of hesitate to add anything to the install script that's for test only, though 🫤
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That name is based on $(VERSION)
as you point out, and it's for rotation purposes, and I think it's not something we ought to be doing, but it is what it is.
I agree, the install script should not have test stuff in it. It will cause problems for users, as they don't necessarily have access to the kustomize configuration files.
7cd03f1
to
a611452
Compare
a611452
to
a1bc7c2
Compare
At this point, due to the $(VERISON) naming issue with certs, I think that putting this as part of the e2e manifest is not realistic now. Perhaps if we remove that from the name of the certificate, we can re-look at that. My only hold objection is to move |
I'm a little bit confused by your suggestion here. Are you asking that I create a separate overlay for prometheus which includes nothing except for the prometheus manifests? If so, I'm not sure I see the utility of that. It would also break the convention of that folder, since the rest of those folders are meant to be applied on top of the standard manifests to create the monolithic manifest files in Or, are you suggesting that I create an additional overlay which adds the prometheus CRDs and CRs on top of the standard manifests, and puts them all into the big manifest file? If so, we will still run into the problem I mentioned here, since the CRDs and CRs cannot be applied from the same file in the same step. It also causes the installation to run into this issue, since Unless I'm still misunderstanding something, or unless you can think of solutions to the above, would you mind dropping your hold? |
I'm asking you to move the directory So, it's not a new overlay, it's just moving the existing one. The reason is that the root |
/unhold |
a1bc7c2
to
cd29968
Compare
Introduces an early-warning series of prometheus alerts to attempt to catch issues with performance at an early stage in development. Signed-off-by: Daniel Franz <[email protected]>
cd29968
to
9a97ad8
Compare
Overlay containing manifest files which enable prometheus scraping of the catalogd and operator-controller pods. Used during e2e runs to measure performance over the lifetime of the test. | ||
|
||
These manifests will not end up in the `manifests/` folder, as they must be applied in two distinct steps to avoid issues with applying prometheus CRDs and CRs simultaneously. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add here something like:
Note that the values defined for the performance checks are fixed and defined in : config/overlays/prometheus/prometheus_rule.yaml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am just trying to have it write down in some place to make easier understand it if we need afterwords
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we spoke this, and to make things easier, I’m OK with going incrementally here. We can keep the scope smaller for now and make improvements in a follow-up to this PR.
-
I think we need a shell script to deploy and verify the Prometheus installation.
I shared an example here to better explain the idea:
https://github.com/operator-framework/operator-controller/pull/2081/files#r2209548237 -
Just a small suggestion: maybe add a note in the docs (or somewhere else) to explain that the rule values are fixed and where they come from.
Reference: https://github.com/operator-framework/operator-controller/pull/2081/files#r2209551143
Let’s just make sure we don’t forget to follow up later. 😉
Also, based on @tmshort’s comment
(#2081 (comment)),
we’ll probably need another follow-up to improve the manifests too.
Since @tmshort unhold it seems fine we move forward with it now.
We should ensure that only e2e installs has Prometheus.
However, now we are generting the manifests: https://github.com/operator-framework/operator-controller/tree/main/manifests and I cannot see any required changes there. So, it seems fine 👍
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: camilamacedo86, trgeiger The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
ebc7986
into
operator-framework:main
Description
Introduces an early-warning series of prometheus alerts to attempt to catch issues with performance at an early stage in development.
As the e2e tests run, the installed prometheus instance is scraping metrics from catalogd and operator-controller, and will fire alerts based on rules introduced in this PR. Since we're running these tests on the github runners which do not have consistent performance, our alerts must be based on platform-independent metrics and are therefore limited. Any other ideas for metrics to check on this PR are appreciated!
Once the e2e tests finish, prometheus is queried for active alerts. Any alerts found in
pending
state will result in a warning being set on the e2e workflow. Any alerts infiring
state will give an error. These errors do not (at the moment) fail the run, but are visible when the workflow details are viewed.For instance:
I am not making this a required check until we have a pretty good idea of an approximate baseline.
Potential Enhancements:
Remove yaml from script and organize into an additional kustomization componentDoneCloses #1904
Closes #1905
Reviewer Checklist