feat: Update controller logic to handle stale SriovNetworkNodeState CRs with delay #798

ykulazhenkov · 2024-10-28T17:36:00Z

Update controller logic to handle stale SriovNetworkNodeState CRs with delay

Changed the logic in the sriov-network-operator controller to handle stale SriovNetworkNodeState CRs (those with no matching Nodes with daemon).
Introduced a delay (30 minutes by default) before removing stale state CRs to manage scenarios where the user temporarily removes the daemon from the node but does not want to lose the state stored in the SriovNetworkNodeState.
Added the STALE_NODE_STATE_CLEANUP_DELAY_MINUTES environment variable to configure the required delay in minutes (default is 30 minutes).

This functionality especially useful when the OFED container is in use. As the OFED driver loads on the host, the sriov-config-daemon is removed from this node (achieved using configDaemon nodeselector). Since loading the driver can take a considerable amount of time, we want to ensure that the SriovNetworkNodeState is not lost during this process.

github-actions · 2024-10-28T17:36:11Z

Thanks for your PR,
To run vendors CIs, Maintainers can use one of:

/test-all: To run all tests for all vendors.
/test-e2e-all: To run all E2E tests for all vendors.
/test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs, Maintainers can use one of:

/skip-all: To skip all tests for all vendors.
/skip-e2e-all: To skip all E2E tests for all vendors.
/skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
Best regards.

coveralls · 2024-10-28T17:42:41Z

Pull Request Test Coverage Report for Build 12122609914

Details

57 of 70 (81.43%) changed or added relevant lines in 2 files are covered.
6 unchanged lines in 2 files lost coverage.
Overall coverage increased (+0.1%) to 47.315%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
api/v1/helper.go	20	23	86.96%
controllers/sriovnetworknodepolicy_controller.go	37	47	78.72%

Files with Coverage Reduction	New Missed Lines	%
controllers/sriovnetworknodepolicy_controller.go	1	59.27%
controllers/generic_network_controller.go	5	74.38%

Totals
Change from base Build 12046950769:	0.1%
Covered Lines:	7242
Relevant Lines:	15306

💛 - Coveralls

almaslennikov

LGTM

e0ne · 2024-10-29T08:31:31Z

controllers/sriovnetworknodepolicy_controller.go

+		// keep until time annotation is not set, set a new value with default or configured offset and update the object
+		delayMinutes, err := strconv.Atoi(os.Getenv("STALE_NODE_STATE_CLEANUP_DELAY"))
+		if err != nil || delayMinutes <= 0 {
+			delayMinutes = 30 // keep objects for 30 minutes by default


Let's add some logging here to indicate env variable is incorrect

nit: we've got unused CleanupTimeout [1] variable. maybe we need to change it's value and use it here

[1]

sriov-network-operator/test/util/util.go

Line 34 in 68b6c02

CleanupTimeout = time.Second * 5

I think we should not use this one because it is in "test" package. But I agree that we can create constant with default value.

I added constant to hold the default value

adrianchiris · 2024-10-29T11:25:47Z

deployment/sriov-network-operator-chart/values.yaml

@@ -27,6 +27,9 @@ operator:
  resourcePrefix: "openshift.io"
  cniBinPath: "/opt/cni/bin"
  clusterType: "kubernetes"
+  # minimal amount of time (in minutes) the operator will wait before removing
+  # stale SriovNetworkNodeState objects (objects that doesn't match node with the daemon)
+  staleNodeStateCleanupDelay: "30"


nit: staleNodeStateCleanupDelayMinutes ? or is that too long in your opinion ?

adrianchiris

overall lgtm from my side, added small comment.

once Ivan's comment addressed (we should use a constant IMO) im LGTM.

adrianchiris · 2024-10-29T11:34:25Z

controllers/sriovnetworknodepolicy_controller.go

-				err := r.Delete(ctx, &ns, &client.DeleteOptions{})
-				if err != nil {
-					logger.Error(err, "Fail to Delete", "SriovNetworkNodeState CR:", ns.GetName())
+				if err := r.handleStaleNodeState(ctx, &ns); err != nil {


note to self: this is being called every ~5 min (resync period) worst case or when policy is updated or when policy/node changed

ykulazhenkov · 2024-10-29T11:49:51Z

CI failure is not related to the change. The same failure occurs on the PR with dummy changes #800

ykulazhenkov · 2024-10-29T12:53:31Z

@e0ne @adrianchiris I addressed your comments. I also changed behavior a bit to completely avoid any delay in case if STALE_NODE_STATE_CLEANUP_DELAY_MINUTES env is explicitly set to 0

e0ne

LGTM. Thanks for addressing my comments

adrianchiris

LGTM

…Rs with delay - Changed the logic in the sriov-network-operator controller to handle stale SriovNetworkNodeState CRs (those with no matching Nodes with daemon). - Introduced a delay (30 minutes by default) before removing stale state CRs to manage scenarios where the user temporarily removes the daemon from the node but does not want to lose the state stored in the SriovNetworkNodeState. - Added the `STALE_NODE_STATE_CLEANUP_DELAY_MINUTES` environment variable to configure the required delay in minutes (default is 30 minutes).

ykulazhenkov requested review from SchSeba and adrianchiris October 28, 2024 17:36

almaslennikov approved these changes Oct 29, 2024

View reviewed changes

e0ne requested changes Oct 29, 2024

View reviewed changes

adrianchiris reviewed Oct 29, 2024

View reviewed changes

ykulazhenkov force-pushed the pr-keep-stale-node-state branch from 4245aa3 to 13dc502 Compare October 29, 2024 12:50

ykulazhenkov requested review from adrianchiris and e0ne October 29, 2024 12:53

ykulazhenkov force-pushed the pr-keep-stale-node-state branch from 13dc502 to 994ddbf Compare October 29, 2024 14:36

e0ne approved these changes Oct 30, 2024

View reviewed changes

adrianchiris approved these changes Oct 30, 2024

View reviewed changes

ykulazhenkov force-pushed the pr-keep-stale-node-state branch 2 times, most recently from 1898df7 to 4ea6ce0 Compare December 2, 2024 15:08

ykulazhenkov force-pushed the pr-keep-stale-node-state branch from 4ea6ce0 to 5ad4ae9 Compare December 2, 2024 15:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Update controller logic to handle stale SriovNetworkNodeState CRs with delay #798

feat: Update controller logic to handle stale SriovNetworkNodeState CRs with delay #798

ykulazhenkov commented Oct 28, 2024 •

edited

Loading

github-actions bot commented Oct 28, 2024

coveralls commented Oct 28, 2024 •

edited

Loading

almaslennikov left a comment

e0ne Oct 29, 2024

e0ne Oct 29, 2024

ykulazhenkov Oct 29, 2024

ykulazhenkov Oct 29, 2024 •

edited

Loading

adrianchiris Oct 29, 2024

ykulazhenkov Oct 29, 2024

adrianchiris left a comment

adrianchiris Oct 29, 2024

ykulazhenkov commented Oct 29, 2024

ykulazhenkov commented Oct 29, 2024

e0ne left a comment

adrianchiris left a comment

feat: Update controller logic to handle stale SriovNetworkNodeState CRs with delay #798

Are you sure you want to change the base?

feat: Update controller logic to handle stale SriovNetworkNodeState CRs with delay #798

Conversation

ykulazhenkov commented Oct 28, 2024 • edited Loading

github-actions bot commented Oct 28, 2024

coveralls commented Oct 28, 2024 • edited Loading

Pull Request Test Coverage Report for Build 12122609914

Details

💛 - Coveralls

almaslennikov left a comment

Choose a reason for hiding this comment

e0ne Oct 29, 2024

Choose a reason for hiding this comment

e0ne Oct 29, 2024

Choose a reason for hiding this comment

ykulazhenkov Oct 29, 2024

Choose a reason for hiding this comment

ykulazhenkov Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

adrianchiris Oct 29, 2024

Choose a reason for hiding this comment

ykulazhenkov Oct 29, 2024

Choose a reason for hiding this comment

adrianchiris left a comment

Choose a reason for hiding this comment

adrianchiris Oct 29, 2024

Choose a reason for hiding this comment

ykulazhenkov commented Oct 29, 2024

ykulazhenkov commented Oct 29, 2024

e0ne left a comment

Choose a reason for hiding this comment

adrianchiris left a comment

Choose a reason for hiding this comment

ykulazhenkov commented Oct 28, 2024 •

edited

Loading

coveralls commented Oct 28, 2024 •

edited

Loading

ykulazhenkov Oct 29, 2024 •

edited

Loading