Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate Flakiness in Post-Upgrade Test Timing #1550

Open
camilamacedo86 opened this issue Jan 7, 2025 · 2 comments
Open

Investigate Flakiness in Post-Upgrade Test Timing #1550

camilamacedo86 opened this issue Jan 7, 2025 · 2 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@camilamacedo86
Copy link
Contributor

camilamacedo86 commented Jan 7, 2025

The post-upgrade E2E test at post_upgrade_test.go#L101-L112 appears to be experiencing flakiness due to timing issues. Currently, the test has a 1-minute timeout, which should theoretically be sufficient for the operations it validates. However, under certain conditions, it seems the test may fail due to delays in reconciliation or related operations.

Expected Behavior

The test should complete successfully within the allocated 1-minute timeout.

Observed Behavior

The test sometimes fails, likely due to timing issues during the post-upgrade process.

Analysis

Without additional context, it appears that the 1-minute timeout should be sufficient to cover the following steps:

  1. Catalog cache repopulation (potentially async, initiated after OLM upgrade).
  2. **Reconciliation of the cluster extension ** may involve exponential backoff while the catalog cache repopulates.
  3. Upgrade process:
    • Resolving and running the upgrade after cache repopulation.
    • Dry-running the upgrade to generate the desired manifest.
    • Running preflight checks.
    • If the checks pass, the actual upgrade will be run (similar to a helm upgrade without waiting for object health/readiness).

A potential root cause is hitting exponential backoff on cache repopulation, where an extra reconciliation attempt increases the backoff time, possibly exceeding the 1-minute threshold.

Steps to Reproduce

  1. Push a PR to trigger the test
  2. Before finishing, push another change. Then, when more than 1 GitHub action is running, the tests often fail, and we can check the flake.

Suggested Next Steps

  • Investigate if exponential backoff during cache repopulation is causing the delay.
  • Add logging to identify where the test is spending time and which step(s) contribute most to the timeout.
  • Consider increasing the timeout to accommodate edge cases or optimizing the underlying operations.
  • Determine if additional retries or adjustments to reconciliation backoff logic are needed.
  • Check if GitHub Actions share the resources across the many VMs created (Maybe)

More info: #1548

@camilamacedo86 camilamacedo86 added kind/documentation Categorizes issue or PR as related to documentation. kind/bug Categorizes issue or PR as related to a bug. and removed kind/documentation Categorizes issue or PR as related to documentation. labels Jan 7, 2025
@azych
Copy link
Contributor

azych commented Jan 15, 2025

I looked into this a little bit locally, since I don't have permissions to run jobs here.

This is what I found:

  • it takes roughly around 20-25 seconds since the first attempt to repopulate the cache for it to actually succeed
  • reconciliation finishes under 1 sec
  • entire test case took around 45-50 seconds (so I can imagine it can be cutting it pretty close in CI given additional load)
  • cache repopulation seems to contribute majority of time, apart from backoff

Also, Interesting point about resource sharing - just looking briefly at the docs, they mention that each github-hosted runner is a separate VM, but I'm not sure what type of runners do we use.

@perdasilva
Copy link
Contributor

@azych is spot on. The issue is that the cache gets nuked and there's no feedback in the catalog status to say it's being re-built and therefore the content is unavailable. So, all the guards checking for catalog status don't really hold the upgrade check back until catalog is live again. I've created a ticket to track this #1626 and added a mitigation to remove the flakiness #1627

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

3 participants