Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optmize refreshing SriovNetworkNodeState #635

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

alan-kut
Copy link
Contributor

There is a SriovNetworkNodeState watcher running.
It should be used to keep updated state of the object. Requesting latest state of the object very often
overloads the kube-apiserver in big clusters.

Signed-of-by: Alan Kutniewski [email protected]

Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@coveralls
Copy link

coveralls commented Feb 13, 2024

Pull Request Test Coverage Report for Build 7913628099

Details

  • -7 of 16 (56.25%) changed or added relevant lines in 2 files are covered.
  • 3 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.1%) to 30.287%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/daemon/writer.go 0 7 0.0%
Files with Coverage Reduction New Missed Lines %
api/v1/helper.go 3 43.73%
Totals Coverage Status
Change from base Build 7875814354: 0.1%
Covered Lines: 3562
Relevant Lines: 11761

💛 - Coveralls

@alan-kut alan-kut force-pushed the pr-optimize-SriovNetworkNodeStates branch from f919331 to 1de0993 Compare February 15, 2024 09:09
Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

There is a SriovNetworkNodeState watcher running.
It should be used to keep updated state of the object.
Requesting latest state of the object very often
overloads the kube-apiserver in big clusters.

Signed-of-by: Alan Kutniewski <[email protected]>
@alan-kut alan-kut force-pushed the pr-optimize-SriovNetworkNodeStates branch from 1de0993 to f9cf915 Compare February 15, 2024 09:18
Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@alan-kut
Copy link
Contributor Author

/test-all

@@ -511,7 +522,7 @@ func (dn *Daemon) nodeStateSyncHandler() error {
// we need to load the latest status to our object
// if we don't do it we can have a race here where the user remove the virtual functions but the operator didn't
// trigger the refresh
updatedState, err := dn.client.SriovnetworkV1().SriovNetworkNodeStates(vars.Namespace).Get(context.Background(), vars.NodeName, metav1.GetOptions{})
updatedState, err := dn.getSriovNetworkNodeState()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should be careful. there are cases where we do want the latest version from API and not the locally cached object.
one way to maintain consistency WRT node state status is to use mutating cache[1] (and updating it from writer)

@SchSeba can perhaps provide more info as the author of the comment above this change.

We had discussions to get rid of the writer routine altogether, having the daemon compute the current state from what there currently is on the node then reconcile and update status which should reduce API requests AFAIU.

[1] https://github.com/kubernetes/client-go/blob/306b201a2d292fd78483fdf6147131926ed25a78/tools/cache/mutation_cache.go#L39

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

controllers must only have only one source of truth, and this is obtained from the informers lists, so they are eventually consistent this is how this is designed ... you must follow the best practices and avoid touching on the internals of client-go

https://github.com/kubernetes/community/blob/master/contributors/devel/sig-api-machinery/controllers.md
https://github.com/kubernetes/kubernetes/tree/master/staging/src/k8s.io/sample-controller

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 so you enqueue the object in the workqueue but you don-t plumb the enqueed object and instead you do a quorum read? this is racy, as explained above, the controller must operate with the object from the watch and reconcile, if there are changes the next reconcilation will move the state to the desired state

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @aojea ,
Mutating Cache is public API. its also used in [1]. until we go ahead and rework this part of sriov-network-config-daemon i dont have a better idea :).

please elaborate what you are proposing.

[1] https://github.com/kubernetes/dynamic-resource-allocation/blob/b9a145322b8a89fe625c17bcd507f79cc16215a3/controller/controller.go#L218

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was added for a very specific case, actually AFAIK only Patrick is using in that controller, we run multiple controllers in kubernetes core without having to use that and they are tested daily in clusters with 15k nodes ... without knowing your exact problem I can not suggest anything in concrete, just saying that in my experience the canonical way of using controllers works for 99% of the cases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting back to

we should be careful. there are cases where we do want the latest version from API and not the locally cached object.

What is the reason? The delay is never 0 - there is kube-apisever latency and there is latency between getting the object and acting on it - so best case scenario the code works on, let's say, 50ms old object. There is no guarantee that the object hasn't been changed since.
On the other side watcher usually, in healthy cluster, should keep the object almost as fresh with much smaller load on kcp (no need to send additional request and no need to fetch the latest version of the object from the etcd).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we need to update the object we don't need the latest version - kube-apiserver can decline the update due to outdated object and then controller may retry - at some point it will get updated object via watcher and the update request would succeed.
When we need to act on update we can wait a little bit longer, get the object via watcher and act on it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason?

that is why i ask @SchSeba for input :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi folks,

First thanks for all the great input!
I am working to rewrite the daemon part of the code but in general the reason we are doing this is because before we start the work we run the writer function that does a discovery of all the sriov devices in the system that is why we have the wait in line https://github.com/k8snetworkplumbingwg/sriov-network-operator/pull/635/files#diff-a53b7b593d3d778e62eaeeafa40088656f9212bfa2c2b7991df15fa78e60b0f0R520

I agree that is not the right way and I am working on a PR to remove the go routine running a writer

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SchSeba I understand why you need the wait here. but could you also elaborate why you need to fetch the most recent state. Your comment says "we can have a race here where the user remove the virtual functions but the operator didn't trigger the refresh". I appreciate if you can describe an example to help me understand

@@ -207,7 +213,7 @@ func (w *NodeStateStatusWriter) getNodeState() (*sriovnetworkv1.SriovNetworkNode
var lastErr error
var n *sriovnetworkv1.SriovNetworkNodeState
err := wait.PollImmediate(10*time.Second, 5*time.Minute, func() (bool, error) {
n, lastErr = w.client.SriovnetworkV1().SriovNetworkNodeStates(vars.Namespace).Get(context.Background(), vars.NodeName, metav1.GetOptions{})
n, lastErr = w.client.SriovnetworkV1().SriovNetworkNodeStates(vars.Namespace).Get(context.Background(), vars.NodeName, metav1.GetOptions{ResourceVersion: "0"})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always keep forgetting can you please remind me why we need the ResourceVersion to be zero here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from k8s docs[1]:

Return data at any resource version. The newest available resource version is preferred, but strong consistency is not required; data at any resource version may be served. It is possible for the request to return data at a much older resource version that the client has previously observed, particularly in high availability configurations, due to partitions or stale caches. Clients that cannot tolerate this should not use this semantic.

[1] https://kubernetes.io/docs/reference/using-api/api-concepts/#resourceversion-in-metadata

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you put ResourceVersion="0" you get a object from the kube-apiserver cache.
If you request latest version (without setting ResourceVersion) the request always needs to reach etcd and retrieve the latest version.

I was running this code at 6k nodes scale. It refreshes the state every 30s. This means that there was a constant 200 QPS that needs to fetch the object from the etcd and hardly ever (if ever) change anything.

In general the preferable way to keep track of the object is to use a Watcher:
ListAndWatch handles the whole flow but in general:

  • List objects with RV=0 to get latest state from the kube-apiserver cache
  • Watch starting from the last version so you get all the updates since the returned state.

I'm not aware of any usecase in which you really need to have latest state from the etcd.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants