-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optmize refreshing SriovNetworkNodeState #635
base: master
Are you sure you want to change the base?
Optmize refreshing SriovNetworkNodeState #635
Conversation
Thanks for your PR,
To skip the vendors CIs use one of:
|
Pull Request Test Coverage Report for Build 7913628099Details
💛 - Coveralls |
f919331
to
1de0993
Compare
Thanks for your PR,
To skip the vendors CIs use one of:
|
There is a SriovNetworkNodeState watcher running. It should be used to keep updated state of the object. Requesting latest state of the object very often overloads the kube-apiserver in big clusters. Signed-of-by: Alan Kutniewski <[email protected]>
1de0993
to
f9cf915
Compare
Thanks for your PR,
To skip the vendors CIs use one of:
|
/test-all |
@@ -511,7 +522,7 @@ func (dn *Daemon) nodeStateSyncHandler() error { | |||
// we need to load the latest status to our object | |||
// if we don't do it we can have a race here where the user remove the virtual functions but the operator didn't | |||
// trigger the refresh | |||
updatedState, err := dn.client.SriovnetworkV1().SriovNetworkNodeStates(vars.Namespace).Get(context.Background(), vars.NodeName, metav1.GetOptions{}) | |||
updatedState, err := dn.getSriovNetworkNodeState() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should be careful. there are cases where we do want the latest version from API and not the locally cached object.
one way to maintain consistency WRT node state status is to use mutating cache[1] (and updating it from writer)
@SchSeba can perhaps provide more info as the author of the comment above this change.
We had discussions to get rid of the writer routine altogether, having the daemon compute the current state from what there currently is on the node then reconcile and update status which should reduce API requests AFAIU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
controllers must only have only one source of truth, and this is obtained from the informers lists, so they are eventually consistent this is how this is designed ... you must follow the best practices and avoid touching on the internals of client-go
https://github.com/kubernetes/community/blob/master/contributors/devel/sig-api-machinery/controllers.md
https://github.com/kubernetes/kubernetes/tree/master/staging/src/k8s.io/sample-controller
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 so you enqueue the object in the workqueue but you don-t plumb the enqueed object and instead you do a quorum read? this is racy, as explained above, the controller must operate with the object from the watch and reconcile, if there are changes the next reconcilation will move the state to the desired state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @aojea ,
Mutating Cache is public API. its also used in [1]. until we go ahead and rework this part of sriov-network-config-daemon i dont have a better idea :).
please elaborate what you are proposing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was added for a very specific case, actually AFAIK only Patrick is using in that controller, we run multiple controllers in kubernetes core without having to use that and they are tested daily in clusters with 15k nodes ... without knowing your exact problem I can not suggest anything in concrete, just saying that in my experience the canonical way of using controllers works for 99% of the cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting back to
we should be careful. there are cases where we do want the latest version from API and not the locally cached object.
What is the reason? The delay is never 0 - there is kube-apisever latency and there is latency between getting the object and acting on it - so best case scenario the code works on, let's say, 50ms old object. There is no guarantee that the object hasn't been changed since.
On the other side watcher usually, in healthy cluster, should keep the object almost as fresh with much smaller load on kcp (no need to send additional request and no need to fetch the latest version of the object from the etcd).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we need to update the object we don't need the latest version - kube-apiserver can decline the update due to outdated object and then controller may retry - at some point it will get updated object via watcher and the update request would succeed.
When we need to act on update we can wait a little bit longer, get the object via watcher and act on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason?
that is why i ask @SchSeba for input :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi folks,
First thanks for all the great input!
I am working to rewrite the daemon part of the code but in general the reason we are doing this is because before we start the work we run the writer function that does a discovery of all the sriov devices in the system that is why we have the wait in line https://github.com/k8snetworkplumbingwg/sriov-network-operator/pull/635/files#diff-a53b7b593d3d778e62eaeeafa40088656f9212bfa2c2b7991df15fa78e60b0f0R520
I agree that is not the right way and I am working on a PR to remove the go routine running a writer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SchSeba I understand why you need the wait here. but could you also elaborate why you need to fetch the most recent state. Your comment says "we can have a race here where the user remove the virtual functions but the operator didn't trigger the refresh". I appreciate if you can describe an example to help me understand
@@ -207,7 +213,7 @@ func (w *NodeStateStatusWriter) getNodeState() (*sriovnetworkv1.SriovNetworkNode | |||
var lastErr error | |||
var n *sriovnetworkv1.SriovNetworkNodeState | |||
err := wait.PollImmediate(10*time.Second, 5*time.Minute, func() (bool, error) { | |||
n, lastErr = w.client.SriovnetworkV1().SriovNetworkNodeStates(vars.Namespace).Get(context.Background(), vars.NodeName, metav1.GetOptions{}) | |||
n, lastErr = w.client.SriovnetworkV1().SriovNetworkNodeStates(vars.Namespace).Get(context.Background(), vars.NodeName, metav1.GetOptions{ResourceVersion: "0"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I always keep forgetting can you please remind me why we need the ResourceVersion
to be zero here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from k8s docs[1]:
Return data at any resource version. The newest available resource version is preferred, but strong consistency is not required; data at any resource version may be served. It is possible for the request to return data at a much older resource version that the client has previously observed, particularly in high availability configurations, due to partitions or stale caches. Clients that cannot tolerate this should not use this semantic.
[1] https://kubernetes.io/docs/reference/using-api/api-concepts/#resourceversion-in-metadata
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you put ResourceVersion="0" you get a object from the kube-apiserver cache.
If you request latest version (without setting ResourceVersion) the request always needs to reach etcd and retrieve the latest version.
I was running this code at 6k nodes scale. It refreshes the state every 30s. This means that there was a constant 200 QPS that needs to fetch the object from the etcd and hardly ever (if ever) change anything.
In general the preferable way to keep track of the object is to use a Watcher:
ListAndWatch handles the whole flow but in general:
- List objects with RV=0 to get latest state from the kube-apiserver cache
- Watch starting from the last version so you get all the updates since the returned state.
I'm not aware of any usecase in which you really need to have latest state from the etcd.
There is a SriovNetworkNodeState watcher running.
It should be used to keep updated state of the object. Requesting latest state of the object very often
overloads the kube-apiserver in big clusters.
Signed-of-by: Alan Kutniewski [email protected]