Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 mitigate upgrade-e2e test flakiness #1627

Merged
merged 2 commits into from
Jan 17, 2025

Conversation

perdasilva
Copy link
Contributor

@perdasilva perdasilva commented Jan 15, 2025

Description

  • Mitigates the upgrade-e2e test flakiness caused by 🐛 catalog status is not updated during re-unpack after pod reset #1626 by waiting for the catalog's last unpacked to be after the catalogd deployment pod's creation time
  • Updates the upgrade-e2e test to also wait for the catalogd deployment to be up
  • Factors out the artifact collection code into a utils package and adds artifact collection to upgrade-e2e

Reviewer Checklist

  • API Go Documentation
  • Tests: Unit Tests (and E2E Tests, if appropriate)
  • Comprehensive Commit Messages
  • Links to related GitHub Issue(s)

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 15, 2025
Copy link

netlify bot commented Jan 15, 2025

Deploy Preview for olmv1 ready!

Name Link
🔨 Latest commit a1a9777
🔍 Latest deploy log https://app.netlify.com/sites/olmv1/deploys/678a2e03b446e60008b7b419
😎 Deploy Preview https://deploy-preview-1627--olmv1.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@perdasilva perdasilva changed the title 🐛 Mitigate upgrde2e 🐛 mitigate upgrade-e2e test flakiness Jan 15, 2025
Copy link

codecov bot commented Jan 15, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 66.72%. Comparing base (c8380d2) to head (a1a9777).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1627      +/-   ##
==========================================
+ Coverage   66.70%   66.72%   +0.02%     
==========================================
  Files          57       57              
  Lines        4595     4595              
==========================================
+ Hits         3065     3066       +1     
+ Misses       1304     1303       -1     
  Partials      226      226              
Flag Coverage Δ
e2e 53.16% <ø> (-0.09%) ⬇️
unit 53.68% <ø> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@perdasilva perdasilva marked this pull request as ready for review January 15, 2025 16:50
@perdasilva perdasilva requested a review from a team as a code owner January 15, 2025 16:50
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 15, 2025
@LalatenduMohanty
Copy link
Member

We have #1550 for this issue as well

Copy link
Contributor

@azych azych left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

test/upgrade-e2e/post_upgrade_test.go Outdated Show resolved Hide resolved
camilamacedo86
camilamacedo86 previously approved these changes Jan 16, 2025
Copy link
Contributor

@camilamacedo86 camilamacedo86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am OK with the changes
Thank you a lot for this contribution
+1 in the nit raised by @azych
we might want to address it follow and it across the whole project

@perdasilva could you please add in the description that this PR closes: #1550

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 16, 2025
@LalatenduMohanty
Copy link
Member

@perdasilva How can I test this PR locally on my laptop? I have local kind setup created usng make run.

@LalatenduMohanty
Copy link
Member

I deleted the kind setup and $ make test-upgrade-e2e should test this PR.

@perdasilva
Copy link
Contributor Author

perdasilva commented Jan 16, 2025

@perdasilva How can I test this PR locally on my laptop? I have local kind setup created usng make run.

make test-upgrade-e2e will exercise the upgrade test

if you want to step through it, it also has the recipe for it:

.PHONY: test-upgrade-e2e
test-upgrade-e2e: KIND_CLUSTER_NAME := operator-controller-upgrade-e2e
test-upgrade-e2e: export TEST_CLUSTER_CATALOG_NAME := test-catalog
test-upgrade-e2e: export TEST_CLUSTER_EXTENSION_NAME := test-package
test-upgrade-e2e: kind-cluster run-latest-release image-registry pre-upgrade-setup docker-build kind-load kind-deploy post-upgrade-checks kind-clean #HELP Run upgrade e2e tests on a local kind cluster

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jan 16, 2025
@LalatenduMohanty
Copy link
Member

I could reproduce the issue without your PR, however when I tested your PR it failed with

=== RUN   TestClusterExtensionAfterOLMUpgrade
    post_upgrade_test.go:31: Starting checks after OLM upgrade
    post_upgrade_test.go:36: Wait for catalogd deployment to be ready
    post_upgrade_test.go:121: Checking that the deployment is updated
    post_upgrade_test.go:137: Waiting for only one Pod to remain
    post_upgrade_test.go:40: Wait for operator-controller deployment to be ready
    post_upgrade_test.go:121: Checking that the deployment is updated
    post_upgrade_test.go:137: Waiting for only one Pod to remain
    post_upgrade_test.go:43: Reading logs to make sure that ClusterExtension was reconciled by operator-controller before we update it
    post_upgrade_test.go:55: Checking that the ClusterCatalog is unpacked
    post_upgrade_test.go:56: 
        	Error Trace:	/home/lmohanty/code/github.com/OLM-Upstream/operator-controller/test/upgrade-e2e/post_upgrade_test.go:62
        	            				/usr/local/go/src/runtime/asm_amd64.s:1700
        	Error:      	Expected nil, but got: &v1.Condition{Type:"Serving", Status:"True", ObservedGeneration:1, LastTransitionTime:time.Date(2025, time.January, 16, 7, 42, 20, 0, time.Local), Reason:"Available", Message:"Serving desired content from resolved source"}
    post_upgrade_test.go:56: 
        	Error Trace:	/home/lmohanty/code/github.com/OLM-Upstream/operator-controller/test/upgrade-e2e/post_upgrade_test.go:72
        	            				/usr/local/go/src/runtime/asm_amd64.s:1700
        	Error:      	Expected nil, but got: &v1.Condition{Type:"Progressing", Status:"True", ObservedGeneration:1, LastTransitionTime:time.Date(2025, time.January, 16, 7, 42, 20, 0, time.Local), Reason:"Succeeded", Message:"Successfully unpacked and stored content from resolved source"}
    post_upgrade_test.go:56: 
        	Error Trace:	/home/lmohanty/code/github.com/OLM-Upstream/operator-controller/test/upgrade-e2e/post_upgrade_test.go:56
        	Error:      	Condition never satisfied
        	Test:       	TestClusterExtensionAfterOLMUpgrade
--- FAIL: TestClusterExtensionAfterOLMUpgrade (66.42s)
FAIL
FAIL	github.com/operator-framework/operator-controller/test/upgrade-e2e	66.450s
FAIL
make: *** [Makefile:228: post-upgrade-checks] Error 1

@perdasilva
Copy link
Contributor Author

I could reproduce the issue without your PR, however when I tested your PR it failed with

=== RUN   TestClusterExtensionAfterOLMUpgrade
    post_upgrade_test.go:31: Starting checks after OLM upgrade
    post_upgrade_test.go:36: Wait for catalogd deployment to be ready
    post_upgrade_test.go:121: Checking that the deployment is updated
    post_upgrade_test.go:137: Waiting for only one Pod to remain
    post_upgrade_test.go:40: Wait for operator-controller deployment to be ready
    post_upgrade_test.go:121: Checking that the deployment is updated
    post_upgrade_test.go:137: Waiting for only one Pod to remain
    post_upgrade_test.go:43: Reading logs to make sure that ClusterExtension was reconciled by operator-controller before we update it
    post_upgrade_test.go:55: Checking that the ClusterCatalog is unpacked
    post_upgrade_test.go:56: 
        	Error Trace:	/home/lmohanty/code/github.com/OLM-Upstream/operator-controller/test/upgrade-e2e/post_upgrade_test.go:62
        	            				/usr/local/go/src/runtime/asm_amd64.s:1700
        	Error:      	Expected nil, but got: &v1.Condition{Type:"Serving", Status:"True", ObservedGeneration:1, LastTransitionTime:time.Date(2025, time.January, 16, 7, 42, 20, 0, time.Local), Reason:"Available", Message:"Serving desired content from resolved source"}
    post_upgrade_test.go:56: 
        	Error Trace:	/home/lmohanty/code/github.com/OLM-Upstream/operator-controller/test/upgrade-e2e/post_upgrade_test.go:72
        	            				/usr/local/go/src/runtime/asm_amd64.s:1700
        	Error:      	Expected nil, but got: &v1.Condition{Type:"Progressing", Status:"True", ObservedGeneration:1, LastTransitionTime:time.Date(2025, time.January, 16, 7, 42, 20, 0, time.Local), Reason:"Succeeded", Message:"Successfully unpacked and stored content from resolved source"}
    post_upgrade_test.go:56: 
        	Error Trace:	/home/lmohanty/code/github.com/OLM-Upstream/operator-controller/test/upgrade-e2e/post_upgrade_test.go:56
        	Error:      	Condition never satisfied
        	Test:       	TestClusterExtensionAfterOLMUpgrade
--- FAIL: TestClusterExtensionAfterOLMUpgrade (66.42s)
FAIL
FAIL	github.com/operator-framework/operator-controller/test/upgrade-e2e	66.450s
FAIL
make: *** [Makefile:228: post-upgrade-checks] Error 1

yeah - seems to be due to some recent changes addressing some reviewer comments. Maybe we're finding out why !assert.NotNil was used...?

@@ -101,16 +101,47 @@ func TestClusterExtensionAfterOLMUpgrade(t *testing.T) {
require.EventuallyWithT(t, func(ct *assert.CollectT) {
assert.NoError(ct, c.Get(ctx, types.NamespacedName{Name: testClusterExtensionName}, &clusterExtension))
cond := apimeta.FindStatusCondition(clusterExtension.Status.Conditions, ocv1.TypeInstalled)
if !assert.NotNil(ct, cond) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will never return false - either it returns true or fails

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I agree.

@perdasilva
Copy link
Contributor Author

@azych I've changed it to assert.NotNil looking at the impl. it seem we either get true or a failure. So, I don't think the if made sense to begin with...?

@camilamacedo86
Copy link
Contributor

Hi @perdasilva,

I liked the other improvements you made. However, regarding the issue, I wanted to clarify something:

It seems like we are now allocating 1 minute for the load and 1 minute for the update. Wouldn’t that effectively result in a total increase of 2 minutes overall? Or did I misunderstand something?

Looking forward to your thoughts!

@perdasilva
Copy link
Contributor Author

perdasilva commented Jan 16, 2025

Hi @perdasilva,

I liked the other improvements you made. However, regarding the issue, I wanted to clarify something:

It seems like we are now allocating 1 minute for the load and 1 minute for the update. Wouldn’t that effectively result in a total increase of 2 minutes overall? Or did I misunderstand something?

Looking forward to your thoughts!

@camilamacedo86 Do you mean because we now wait for the catalogd deployment?
Aside from that, which I think is warranted after the mono-repo changes, I don't think there's any change in the time allocation itself. What this change does is ensure that the catalog has been unpacked and is serving before we wait for the update. I think this is still has the same intention as the original test. It's just that the guard for unpack would pass really quickly, since the conditions on the catalog don't change from the installation step. Does that make sense?

Anecdotally, looking at a few upgrade-e2e with and without this PR, there doesn't seem to be any meaningful change in runtimes

Copy link
Member

@LalatenduMohanty LalatenduMohanty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 16, 2025
@LalatenduMohanty
Copy link
Member

/hold for the #1627 (comment) . Feel free to remove the hold.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 16, 2025
@azych
Copy link
Contributor

azych commented Jan 16, 2025

@azych I've changed it to assert.NotNil looking at the impl. it seem we either get true or a failure. So, I don't think the if made sense to begin with...?

👍 I think your current change makes sense looking at the intention

@camilamacedo86
Copy link
Contributor

HI @LalatenduMohanty

Can we unhold this one?
Can we get it merged?

It is
/lgtm for me !!!

@LalatenduMohanty
Copy link
Member

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 16, 2025
@LalatenduMohanty LalatenduMohanty added this pull request to the merge queue Jan 16, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 16, 2025
@camilamacedo86 camilamacedo86 added this pull request to the merge queue Jan 16, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 16, 2025
@camilamacedo86 camilamacedo86 added this pull request to the merge queue Jan 16, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 16, 2025
@camilamacedo86 camilamacedo86 added this pull request to the merge queue Jan 16, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 16, 2025
@perdasilva perdasilva added this pull request to the merge queue Jan 17, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 17, 2025
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jan 17, 2025
Per Goncalves da Silva added 2 commits January 17, 2025 11:16
Signed-off-by: Per Goncalves da Silva <[email protected]>
Signed-off-by: Per Goncalves da Silva <[email protected]>
@camilamacedo86
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 17, 2025
@camilamacedo86 camilamacedo86 added this pull request to the merge queue Jan 17, 2025
Merged via the queue into operator-framework:main with commit 7446060 Jan 17, 2025
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants