-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate Flakiness in Post-Upgrade Test Timing #1550
Comments
I looked into this a little bit locally, since I don't have permissions to run jobs here. This is what I found:
Also, Interesting point about resource sharing - just looking briefly at the docs, they mention that each github-hosted runner is a separate VM, but I'm not sure what type of runners do we use. |
@azych is spot on. The issue is that the cache gets nuked and there's no feedback in the catalog status to say it's being re-built and therefore the content is unavailable. So, all the guards checking for catalog status don't really hold the upgrade check back until catalog is live again. I've created a ticket to track this #1626 and added a mitigation to remove the flakiness #1627 |
The post-upgrade E2E test at post_upgrade_test.go#L101-L112 appears to be experiencing flakiness due to timing issues. Currently, the test has a 1-minute timeout, which should theoretically be sufficient for the operations it validates. However, under certain conditions, it seems the test may fail due to delays in reconciliation or related operations.
Expected Behavior
The test should complete successfully within the allocated 1-minute timeout.
Observed Behavior
The test sometimes fails, likely due to timing issues during the post-upgrade process.
Analysis
Without additional context, it appears that the 1-minute timeout should be sufficient to cover the following steps:
helm upgrade
without waiting for object health/readiness).A potential root cause is hitting exponential backoff on cache repopulation, where an extra reconciliation attempt increases the backoff time, possibly exceeding the 1-minute threshold.
Steps to Reproduce
Suggested Next Steps
More info: #1548
The text was updated successfully, but these errors were encountered: