Optmize refreshing SriovNetworkNodeState #635

alan-kut · 2024-02-13T16:37:13Z

There is a SriovNetworkNodeState watcher running.
It should be used to keep updated state of the object. Requesting latest state of the object very often
overloads the kube-apiserver in big clusters.

Signed-of-by: Alan Kutniewski [email protected]

github-actions · 2024-02-13T16:37:33Z

Thanks for your PR,
To run vendors CIs use one of:

/test-all: To run all tests for all vendors.
/test-e2e-all: To run all E2E tests for all vendors.
/test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

/skip-all: To skip all tests for all vendors.
/skip-e2e-all: To skip all E2E tests for all vendors.
/skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
Best regards.

coveralls · 2024-02-13T16:42:11Z

Pull Request Test Coverage Report for Build 7913628099

Details

-7 of 16 (56.25%) changed or added relevant lines in 2 files are covered.
3 unchanged lines in 1 file lost coverage.
Overall coverage increased (+0.1%) to 30.287%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/daemon/writer.go	0	7	0.0%

Files with Coverage Reduction	New Missed Lines	%
api/v1/helper.go	3	43.73%

Totals
Change from base Build 7875814354:	0.1%
Covered Lines:	3562
Relevant Lines:	11761

💛 - Coveralls

github-actions · 2024-02-15T09:09:51Z

Thanks for your PR,
To run vendors CIs use one of:

/test-all: To run all tests for all vendors.
/test-e2e-all: To run all E2E tests for all vendors.
/test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

/skip-all: To skip all tests for all vendors.
/skip-e2e-all: To skip all E2E tests for all vendors.
/skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
Best regards.

There is a SriovNetworkNodeState watcher running. It should be used to keep updated state of the object. Requesting latest state of the object very often overloads the kube-apiserver in big clusters. Signed-of-by: Alan Kutniewski <[email protected]>

github-actions · 2024-02-15T09:19:02Z

Thanks for your PR,
To run vendors CIs use one of:

/test-all: To run all tests for all vendors.
/test-e2e-all: To run all E2E tests for all vendors.
/test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

/skip-all: To skip all tests for all vendors.
/skip-e2e-all: To skip all E2E tests for all vendors.
/skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
Best regards.

alan-kut · 2024-02-15T09:26:53Z

/test-all

adrianchiris · 2024-02-15T16:09:56Z

pkg/daemon/daemon.go

@@ -511,7 +522,7 @@ func (dn *Daemon) nodeStateSyncHandler() error {
 	// we need to load the latest status to our object
 	// if we don't do it we can have a race here where the user remove the virtual functions but the operator didn't
 	// trigger the refresh
-	updatedState, err := dn.client.SriovnetworkV1().SriovNetworkNodeStates(vars.Namespace).Get(context.Background(), vars.NodeName, metav1.GetOptions{})
+	updatedState, err := dn.getSriovNetworkNodeState()


we should be careful. there are cases where we do want the latest version from API and not the locally cached object.
one way to maintain consistency WRT node state status is to use mutating cache[1] (and updating it from writer)

@SchSeba can perhaps provide more info as the author of the comment above this change.

We had discussions to get rid of the writer routine altogether, having the daemon compute the current state from what there currently is on the node then reconcile and update status which should reduce API requests AFAIU.

[1] https://github.com/kubernetes/client-go/blob/306b201a2d292fd78483fdf6147131926ed25a78/tools/cache/mutation_cache.go#L39

controllers must only have only one source of truth, and this is obtained from the informers lists, so they are eventually consistent this is how this is designed ... you must follow the best practices and avoid touching on the internals of client-go

https://github.com/kubernetes/community/blob/master/contributors/devel/sig-api-machinery/controllers.md
https://github.com/kubernetes/kubernetes/tree/master/staging/src/k8s.io/sample-controller

🤔 so you enqueue the object in the workqueue but you don-t plumb the enqueed object and instead you do a quorum read? this is racy, as explained above, the controller must operate with the object from the watch and reconcile, if there are changes the next reconcilation will move the state to the desired state

Hi @aojea ,
Mutating Cache is public API. its also used in [1]. until we go ahead and rework this part of sriov-network-config-daemon i dont have a better idea :).

please elaborate what you are proposing.

[1] https://github.com/kubernetes/dynamic-resource-allocation/blob/b9a145322b8a89fe625c17bcd507f79cc16215a3/controller/controller.go#L218

That was added for a very specific case, actually AFAIK only Patrick is using in that controller, we run multiple controllers in kubernetes core without having to use that and they are tested daily in clusters with 15k nodes ... without knowing your exact problem I can not suggest anything in concrete, just saying that in my experience the canonical way of using controllers works for 99% of the cases

Getting back to

we should be careful. there are cases where we do want the latest version from API and not the locally cached object.

What is the reason? The delay is never 0 - there is kube-apisever latency and there is latency between getting the object and acting on it - so best case scenario the code works on, let's say, 50ms old object. There is no guarantee that the object hasn't been changed since.
On the other side watcher usually, in healthy cluster, should keep the object almost as fresh with much smaller load on kcp (no need to send additional request and no need to fetch the latest version of the object from the etcd).

When we need to update the object we don't need the latest version - kube-apiserver can decline the update due to outdated object and then controller may retry - at some point it will get updated object via watcher and the update request would succeed.
When we need to act on update we can wait a little bit longer, get the object via watcher and act on it.

What is the reason?

that is why i ask @SchSeba for input :)

Hi folks,

First thanks for all the great input!
I am working to rewrite the daemon part of the code but in general the reason we are doing this is because before we start the work we run the writer function that does a discovery of all the sriov devices in the system that is why we have the wait in line https://github.com/k8snetworkplumbingwg/sriov-network-operator/pull/635/files#diff-a53b7b593d3d778e62eaeeafa40088656f9212bfa2c2b7991df15fa78e60b0f0R520

I agree that is not the right way and I am working on a PR to remove the go routine running a writer

@SchSeba I understand why you need the wait here. but could you also elaborate why you need to fetch the most recent state. Your comment says "we can have a race here where the user remove the virtual functions but the operator didn't trigger the refresh". I appreciate if you can describe an example to help me understand

SchSeba · 2024-02-20T11:11:50Z

pkg/daemon/writer.go

@@ -207,7 +213,7 @@ func (w *NodeStateStatusWriter) getNodeState() (*sriovnetworkv1.SriovNetworkNode
 	var lastErr error
 	var n *sriovnetworkv1.SriovNetworkNodeState
 	err := wait.PollImmediate(10*time.Second, 5*time.Minute, func() (bool, error) {
-		n, lastErr = w.client.SriovnetworkV1().SriovNetworkNodeStates(vars.Namespace).Get(context.Background(), vars.NodeName, metav1.GetOptions{})
+		n, lastErr = w.client.SriovnetworkV1().SriovNetworkNodeStates(vars.Namespace).Get(context.Background(), vars.NodeName, metav1.GetOptions{ResourceVersion: "0"})


I always keep forgetting can you please remind me why we need the ResourceVersion to be zero here?

from k8s docs[1]:

Return data at any resource version. The newest available resource version is preferred, but strong consistency is not required; data at any resource version may be served. It is possible for the request to return data at a much older resource version that the client has previously observed, particularly in high availability configurations, due to partitions or stale caches. Clients that cannot tolerate this should not use this semantic.

[1] https://kubernetes.io/docs/reference/using-api/api-concepts/#resourceversion-in-metadata

If you put ResourceVersion="0" you get a object from the kube-apiserver cache.
If you request latest version (without setting ResourceVersion) the request always needs to reach etcd and retrieve the latest version.

I was running this code at 6k nodes scale. It refreshes the state every 30s. This means that there was a constant 200 QPS that needs to fetch the object from the etcd and hardly ever (if ever) change anything.

In general the preferable way to keep track of the object is to use a Watcher:
ListAndWatch handles the whole flow but in general:

List objects with RV=0 to get latest state from the kube-apiserver cache

Watch starting from the last version so you get all the updates since the returned state.

I'm not aware of any usecase in which you really need to have latest state from the etcd.

alan-kut force-pushed the pr-optimize-SriovNetworkNodeStates branch from f919331 to 1de0993 Compare February 15, 2024 09:09

alan-kut force-pushed the pr-optimize-SriovNetworkNodeStates branch from 1de0993 to f9cf915 Compare February 15, 2024 09:18

adrianchiris reviewed Feb 15, 2024

View reviewed changes

SchSeba reviewed Feb 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optmize refreshing SriovNetworkNodeState #635

Optmize refreshing SriovNetworkNodeState #635

alan-kut commented Feb 13, 2024

github-actions bot commented Feb 13, 2024

coveralls commented Feb 13, 2024 •

edited

Loading

github-actions bot commented Feb 15, 2024

github-actions bot commented Feb 15, 2024

alan-kut commented Feb 15, 2024

adrianchiris Feb 15, 2024

aojea Feb 16, 2024

aojea Feb 16, 2024

adrianchiris Feb 18, 2024

aojea Feb 18, 2024

alan-kut Feb 19, 2024

alan-kut Feb 19, 2024

adrianchiris Feb 19, 2024

SchSeba Feb 20, 2024

liuyuan10 Feb 27, 2024

SchSeba Feb 20, 2024

adrianchiris Feb 20, 2024

alan-kut Feb 20, 2024

Optmize refreshing SriovNetworkNodeState #635

Are you sure you want to change the base?

Optmize refreshing SriovNetworkNodeState #635

Conversation

alan-kut commented Feb 13, 2024

github-actions bot commented Feb 13, 2024

coveralls commented Feb 13, 2024 • edited Loading

Pull Request Test Coverage Report for Build 7913628099

Details

💛 - Coveralls

github-actions bot commented Feb 15, 2024

github-actions bot commented Feb 15, 2024

alan-kut commented Feb 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Feb 13, 2024 •

edited

Loading