Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Update controller logic to handle stale SriovNetworkNodeState CRs with delay #798

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ykulazhenkov
Copy link
Collaborator

@ykulazhenkov ykulazhenkov commented Oct 28, 2024

Update controller logic to handle stale SriovNetworkNodeState CRs with delay

  • Changed the logic in the sriov-network-operator controller to handle stale SriovNetworkNodeState CRs (those with no matching Nodes with daemon).
  • Introduced a delay (30 minutes by default) before removing stale state CRs to manage scenarios where the user temporarily removes the daemon from the node but does not want to lose the state stored in the SriovNetworkNodeState.
  • Added the STALE_NODE_STATE_CLEANUP_DELAY_MINUTES environment variable to configure the required delay in minutes (default is 30 minutes).

This functionality especially useful when the OFED container is in use. As the OFED driver loads on the host, the sriov-config-daemon is removed from this node (achieved using configDaemon nodeselector). Since loading the driver can take a considerable amount of time, we want to ensure that the SriovNetworkNodeState is not lost during this process.

Copy link

Thanks for your PR,
To run vendors CIs, Maintainers can use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs, Maintainers can use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@coveralls
Copy link

coveralls commented Oct 28, 2024

Pull Request Test Coverage Report for Build 12122609914

Details

  • 57 of 70 (81.43%) changed or added relevant lines in 2 files are covered.
  • 6 unchanged lines in 2 files lost coverage.
  • Overall coverage increased (+0.1%) to 47.315%

Changes Missing Coverage Covered Lines Changed/Added Lines %
api/v1/helper.go 20 23 86.96%
controllers/sriovnetworknodepolicy_controller.go 37 47 78.72%
Files with Coverage Reduction New Missed Lines %
controllers/sriovnetworknodepolicy_controller.go 1 59.27%
controllers/generic_network_controller.go 5 74.38%
Totals Coverage Status
Change from base Build 12046950769: 0.1%
Covered Lines: 7242
Relevant Lines: 15306

💛 - Coveralls

Copy link
Contributor

@almaslennikov almaslennikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

// keep until time annotation is not set, set a new value with default or configured offset and update the object
delayMinutes, err := strconv.Atoi(os.Getenv("STALE_NODE_STATE_CLEANUP_DELAY"))
if err != nil || delayMinutes <= 0 {
delayMinutes = 30 // keep objects for 30 minutes by default
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add some logging here to indicate env variable is incorrect

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we've got unused CleanupTimeout [1] variable. maybe we need to change it's value and use it here

[1]

CleanupTimeout = time.Second * 5

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not use this one because it is in "test" package. But I agree that we can create constant with default value.

Copy link
Collaborator Author

@ykulazhenkov ykulazhenkov Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added constant to hold the default value

@@ -27,6 +27,9 @@ operator:
resourcePrefix: "openshift.io"
cniBinPath: "/opt/cni/bin"
clusterType: "kubernetes"
# minimal amount of time (in minutes) the operator will wait before removing
# stale SriovNetworkNodeState objects (objects that doesn't match node with the daemon)
staleNodeStateCleanupDelay: "30"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: staleNodeStateCleanupDelayMinutes ? or is that too long in your opinion ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Collaborator

@adrianchiris adrianchiris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm from my side, added small comment.

once Ivan's comment addressed (we should use a constant IMO) im LGTM.

err := r.Delete(ctx, &ns, &client.DeleteOptions{})
if err != nil {
logger.Error(err, "Fail to Delete", "SriovNetworkNodeState CR:", ns.GetName())
if err := r.handleStaleNodeState(ctx, &ns); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self: this is being called every ~5 min (resync period) worst case or when policy is updated or when policy/node changed

@ykulazhenkov
Copy link
Collaborator Author

CI failure is not related to the change. The same failure occurs on the PR with dummy changes #800

@ykulazhenkov
Copy link
Collaborator Author

@e0ne @adrianchiris I addressed your comments. I also changed behavior a bit to completely avoid any delay in case if STALE_NODE_STATE_CLEANUP_DELAY_MINUTES env is explicitly set to 0

Copy link
Collaborator

@e0ne e0ne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for addressing my comments

Copy link
Collaborator

@adrianchiris adrianchiris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ykulazhenkov ykulazhenkov force-pushed the pr-keep-stale-node-state branch 2 times, most recently from 1898df7 to 4ea6ce0 Compare December 2, 2024 15:08
…Rs with delay

- Changed the logic in the sriov-network-operator controller to handle stale SriovNetworkNodeState CRs (those with no matching Nodes with daemon).
- Introduced a delay (30 minutes by default) before removing stale state CRs to manage scenarios where the user temporarily removes the daemon from the node but does not want to lose the state stored in the SriovNetworkNodeState.
- Added the `STALE_NODE_STATE_CLEANUP_DELAY_MINUTES` environment variable to configure the required delay in minutes (default is 30 minutes).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants