OCPBUGS-37982: Bug fix: Reduce Frequency of Update Requests for Copied CSVs #3497

bentito · 2025-01-22T15:32:52Z

Description of the change:

Please checkout this doc on scoping out this change: https://docs.google.com/document/d/1P4cSYEP05vDyuhBfilyuWgL5d5OOD5z7JlAyOxpPqps

In this PR we are resurrecting #3411 with the intent to fix what that PR was originally going to fix. Follow on work will address the then revealed problem with metadata.generation|resourceVersion as per the doc comment by @tmshort

Motivation for the change:
[from the linked doc, "How Did We Get Here:]

Original Change for Memory Optimization: Sixteen months ago, we merged a PR in OLMv0 that converted the cache of copied ClusterServiceVersions (CSVs) to use PartialObjectMetadata types instead of caching the full CSV objects. This change was crucial for memory utilization performance gains, enabling OLM to run efficiently on MicroShift, a lightweight Kubernetes distribution designed for edge computing.
Limited Access to Spec/Status: By using PartialObjectMetadata, we only have access to the metadata of copied CSVs, not their spec or status fields. This means the operator lacks the information needed to compare the full content of the copied CSVs with the originals.
Removal of “Hash and Compare” Logic: The change inadvertently removed a core piece of the “hash and compare” logic. Previously, the operator used annotations containing hashes of the non-status and status fields of the original CSV to determine if a copied CSV needed updating. These annotations were not set on the copied CSVs after the change.
Resulting in Excessive Updates: Without the ability to compare hashes, the operator began issuing updates for copied CSVs 100% of the time, regardless of whether they were in sync with the originals. This behavior introduced a high load on the Kubernetes API server, especially in environments with many namespaces and CSVs installed in AllNamespace mode. The increased load also led to higher audit log volumes, impacting users with increased logging costs.

Architectural changes:

Reintroducing Annotations: A proposed fix PR adds back the annotations olm.operatorframework.io/nonStatusCopyHash and olm.operatorframework.io/statusCopyHash to the copied CSVs. These annotations store hashes of the non-status and status fields of the original CSV, respectively.
Reducing Unnecessary Updates: By comparing these hashes, the operator can determine if the copied CSVs are out of sync with the originals and only issue updates when necessary. This reduces the frequency of update requests to the API server and lowers audit log volumes.
Uncovering a New Bug: However, reintroducing the hash comparison logic will uncover a bug due to the use of PartialObjectMetadata for caching copied CSVs. Since we only have access to metadata, if a user manually modifies the spec or status of a copied CSV without changing the hash annotations, the operator cannot detect the change. The operator would incorrectly assume the copied CSV is in sync with the original, leading to potential inconsistencies.

Testing remarks:

Except from the expected changes around the inability to track copied CSV changes made by a user, we should be careful to test the following:

Cannot Revert Memory Optimization: Reverting the changes that introduced PartialObjectMetadata caching is not feasible. The memory optimization is critical for running OLM on MicroShift and supporting edge computing scenarios where resources are constrained.

Reviewer Checklist

by adding annotations to copied CSVs that are populated with hashes of the non-status fields and the status fields. This seems to be how this was intended to work, but was not actually working this way because the annotations never actually existed on the copied CSV. This resulted in a hot loop of update requests being made on all copied CSVs. Signed-off-by: everettraven <[email protected]>

Signed-off-by: everettraven <[email protected]>

Signed-off-by: Brett Tofel <[email protected]>

Code Changes: • Annotation Consistency: We unconditionally set the non-status-hash annotation on the in-memory CSV object, ensuring that prototype.Annotations["olm.operatorframework.io/nonStatusCopyHash"] always matches the final state—even if the existing CSV already matched. • Multi-step Updates: We now issue a separate “normal” Update call after an UpdateStatus call if the CSV’s status hash differs. This keeps the statusCopyHashAnnotation in sync with the actual .status and avoids stale annotation data. Test Changes: • Expected Actions: Each test case now expects the exact create/update/updateStatus calls (and any subsequent update) that the refactored copyToNamespace emits. This includes: 1. Creating a CSV if none exists, 2. Updating the non-status annotation if it changed, 3. Updating the .status subresource if the status hash changed, and 4. Issuing a follow-up metadata update for the new status-hash annotation. • Fake Lister: • Even though the code already called copiedCSVLister.Namespace(ns), our old tests didn’t exercise or strictly verify that part of the interface. As we expanded and refined the tests — particularly around existing vs. non-existing CSV scenarios — we triggered code paths that call .Namespace(ns).Get(...), exposing the incomplete fake. • Better Coverage: By adding a fully implemented fake lister (including List, Get, and Namespace(...)), the new tests accurately reflect the real OLM flow and properly simulate how the operator queries for existing CSVs in a specific namespace. Signed-off-by: Brett Tofel <[email protected]>

Signed-off-by: Brett Tofel <[email protected]>

camilamacedo86 · 2025-02-12T17:33:01Z

pkg/controller/operators/olm/operatorgroup.go

+	} else {
+		// Even if they're the same, ensure the returned prototype is annotated.
+		prototype.Annotations[statusCopyHashAnnotation] = status
+		updated = prototype
 	}


From the code implemented in this PR to the current state, the main addition seems to be this else block (beyond tests).

I’m not entirely sure I fully understand—are we also looking to implement what’s outlined in the Proposed Fixes section of this document? How/where are we addressing the concerns raised in the: Why don’t we just merge the [fix PR](https://github.com/operator-framework/operator-lifecycle-manager/pull/3411) as-is? section?

this is first pass, basically, just merge the old PR. With this PR we're taking path of #4 in the scoping doc: merge the PR with some possible problems, they should be a minor use case: users changing the copied CSVs

but the else is not the only thing done here, the main thing added is the tracking hashes so we can tell what's in need of update.

tmshort

I think there might be some simplification that can be done with the setting of the status/nonstatus annotations.

tmshort · 2025-02-12T19:34:19Z

pkg/controller/operators/olm/operator.go

I'm assuming all the changes here are due to lint?

yeah, and I just ran make lint locally to make sure nothing changed. nothing changed.

tmshort · 2025-02-12T20:00:20Z

pkg/controller/operators/olm/operatorgroup.go

@@ -803,6 +808,7 @@ func (a *Operator) copyToNamespace(prototype *v1alpha1.ClusterServiceVersion, ns

 	existing, err := a.copiedCSVLister.Namespace(nsTo).Get(prototype.GetName())
 	if apierrors.IsNotFound(err) {
+		prototype.Annotations[nonStatusCopyHashAnnotation] = nonstatus


Because copyToNamespace is called in a loop, prototype, being a pointer, is reused multiple times. Which means that these annotations may already be set. Is there any reason why these annotations simply aren't set in ensureCSVsInNamesapces() where the hashes are calculcated?

good point possibly. checking...

So looking at it closer it seems like we shouldn't change it, here's my reasoning:

Keeping the annotation logic here, in copyToNamespace(), encapsulate the update semantics so each call handles its own CSV's state reliably.

We're reusing prototype and accounting for possibly set annotations. If we move the logic to ensureCSVsInNamesapces(), we'll have to duplicate the annotation checking logic because the logic for handling those annotations is tightly coupled with the CSV’s create/update lifecycle.

In copyToNamespace() we need to:
• Distinguish between a new creation (where the annotations don’t exist yet) and an update (where the annotations might already be set but could be outdated).
• Apply the updates in a specific order (first updating the non-status hash, then the status hash, including a status update to avoid mismatches).
• Ensure that each target CSV reflects the current state as expected for that specific namespace.

Aside from the hash handling we'd still need to be doing the above work in copyToNamespace()

openshift-ci bot requested review from camilamacedo86 and kevinrizza January 22, 2025 15:33

bentito force-pushed the OCPBUGS-37982 branch from 01343d6 to 67c82b9 Compare January 22, 2025 16:17

everettraven and others added 7 commits February 11, 2025 09:24

update unit tests

7739d74

Signed-off-by: everettraven <[email protected]>

updates to test so far

5542840

Signed-off-by: everettraven <[email protected]>

Fix typo in operatorgroup_test.go

588bba7

Signed-off-by: Brett Tofel <[email protected]>

Simplistic fix of unit tests - set expected hashes

f6c207b

Signed-off-by: Brett Tofel <[email protected]>

Missed gci run

4b2c88b

Signed-off-by: Brett Tofel <[email protected]>

bentito force-pushed the OCPBUGS-37982 branch from 4bbdc20 to 4b2c88b Compare February 11, 2025 14:43

bentito added 2 commits February 11, 2025 14:53

Set olm tag back to local (from dev)

c6a979f

Signed-off-by: Brett Tofel <[email protected]>

Set local tag back in place on Makefile

8c0659a

Signed-off-by: Brett Tofel <[email protected]>

camilamacedo86 reviewed Feb 12, 2025

View reviewed changes

tmshort reviewed Feb 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-37982: Bug fix: Reduce Frequency of Update Requests for Copied CSVs #3497

OCPBUGS-37982: Bug fix: Reduce Frequency of Update Requests for Copied CSVs #3497

bentito commented Jan 22, 2025

camilamacedo86 Feb 12, 2025 •

edited

Loading

bentito Feb 13, 2025

tmshort left a comment

tmshort Feb 12, 2025

bentito Feb 14, 2025

tmshort Feb 12, 2025

bentito Feb 13, 2025

bentito Feb 14, 2025

OCPBUGS-37982: Bug fix: Reduce Frequency of Update Requests for Copied CSVs #3497

Are you sure you want to change the base?

OCPBUGS-37982: Bug fix: Reduce Frequency of Update Requests for Copied CSVs #3497

Conversation

bentito commented Jan 22, 2025

camilamacedo86 Feb 12, 2025 • edited Loading

Choose a reason for hiding this comment

bentito Feb 13, 2025

Choose a reason for hiding this comment

tmshort left a comment

Choose a reason for hiding this comment

tmshort Feb 12, 2025

Choose a reason for hiding this comment

bentito Feb 14, 2025

Choose a reason for hiding this comment

tmshort Feb 12, 2025

Choose a reason for hiding this comment

bentito Feb 13, 2025

Choose a reason for hiding this comment

bentito Feb 14, 2025

Choose a reason for hiding this comment

camilamacedo86 Feb 12, 2025 •

edited

Loading