Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cerberus scalability issues #67

Closed
chaitanyaenr opened this issue May 27, 2020 · 10 comments
Closed

Cerberus scalability issues #67

chaitanyaenr opened this issue May 27, 2020 · 10 comments
Labels
high priority This issue needs to be fixed ASAP

Comments

@chaitanyaenr
Copy link
Collaborator

chaitanyaenr commented May 27, 2020

A simple run of Cerberus on a 10 node cluster vs 220 nodes proved that we need to improve the way we run the checks for Cerberus to scale well on a cluster with hundreds/thousands of nodes. The time taken to run the checks has been increasing as we add more checks like @mffiedler mentioned in #53. Here are the observed timings:

10 nodes:
2020-05-22 23:00:17,391 [INFO] -------------------------- Iteration Stats -------------------------------
2020-05-22 23:00:17,392 [INFO] Time taken to run watch_nodes in iteration 1: 0.1057283878326416 seconds
2020-05-22 23:00:17,392 [INFO] Time taken to run watch_cluster_operators in iteration 1: 0.3759939670562744 seconds
2020-05-22 23:00:17,392 [INFO] Time taken to run watch_namespaces in iteration 1: 1.1533699035644531 seconds
2020-05-22 23:00:17,392 [INFO] Time taken to run entire_iteration in iteration 1: 4.368650436401367 seconds
2020-05-22 23:00:17,392 [INFO] --------------------------------------------------------------------------

220 nodes:
2020-05-22 23:13:00,130 [INFO] -------------------------- Iteration Stats -------------------------------
2020-05-22 23:13:00,130 [INFO] Time taken to run watch_nodes in iteration 2: 19.62144660949707 seconds
2020-05-22 23:13:00,131 [INFO] Time taken to run watch_cluster_operators in iteration 2: 1.0622196197509766 seconds
2020-05-22 23:13:00,131 [INFO] Time taken to run watch_namespaces in iteration 2: 68.95069146156311 seconds
2020-05-22 23:13:00,131 [INFO] Time taken to run entire_iteration in iteration 2: 161.81616592407227 seconds
2020-05-22 23:13:00,131 [INFO] --------------------------------------------------------------------------

@portante suggested areas for improvement. Multiprocessing will reduce the timing by using available cores to run checks in parallel ( #60 ) but we also need to take a look at optimizing the code to reduce the number of API calls and loops which iterate over the objects to get the status wherever possible in addition to running checks in parallel like @portante suggested. This issue is to track the observations and discuss ways to make Cerberus scale well on a large and dense cluster.

NOTE: The 10 nodes and 220 nodes clusters were different.

Thoughts?

@chaitanyaenr chaitanyaenr added the high priority This issue needs to be fixed ASAP label May 27, 2020
@chaitanyaenr
Copy link
Collaborator Author

@portante
Copy link

Consider updates pulled from master and infra nodes vs compute nodes. You might be want to always watch master and infra nodes, but group compute nodes into smaller sets of nodes that you gather updates from in different iterations.

One might also want to consider randomly sampling compute nodes for updates.

Placing "oc watch" on various objects instead of polling the status, might be a better approach, where you would place a watch first, gather the initial data from each object, and then wait for updates as they change.

@aakarshg
Copy link
Contributor

From what I read in the code, today we're relying on client making a request to kube-apiserver which in turn then makes a request to etcd to get the status of cluster - which is either number of nodes, number of namespaces and so on.. This if run in tandem with a control plan heavy test like mastervert which already puts a lot of stress on apiserver could be catastrophic. Multi-threading is good, but then again we're still putting a lot of additional load on the apiserver.

My suggestion would be to move towards using promql queries - @amitsagtani97 is adding support to touchstone for doing aggregations and run to run comparison . touchstone can also be used as a python library, so we can just look at querying prometheus through touchstone and see if anything changes. This will help with scaling, as we're avoiding going through apiserver (then through etcd), as well as not put additional load on components which are already being stressed through workloads. More importantly, we'd be using a database that's built to scale, and super helpful with short term transactions.

@aakarshg
Copy link
Contributor

From what I read in the code, today we're relying on client making a request to kube-apiserver which in turn then makes a request to etcd to get the status of cluster - which is either number of nodes, number of namespaces and so on.. This if run in tandem with a control plan heavy test like mastervert which already puts a lot of stress on apiserver could be catastrophic. Multi-threading is good, but then again we're still putting a lot of additional load on the apiserver.

My suggestion would be to move towards using promql queries - @amitsagtani97 is adding support to touchstone for doing aggregations and run to run comparison . touchstone can also be used as a python library, so we can just look at querying prometheus through touchstone and see if anything changes. This will help with scaling, as we're avoiding going through apiserver (then through etcd), as well as not put additional load on components which are already being stressed through workloads. More importantly, we'd be using a database that's built to scale, and super helpful with short term transactions.

Just to keep in mind, we can't replace all the calls to apiserver with a prometheus query, but most of the big ones should be easy to replace, and the others that arent supported we should see if making 2 queries in prom would help us.

@chaitanyaenr
Copy link
Collaborator Author

chaitanyaenr commented May 27, 2020

@aakarshg It's a good idea, integrating Prometheus will also help with #65 but depending on Prometheus would mean that Cerberus won't be able to get the information about what's going on with the other components when Prometheus is down even though the cluster is still functional. The same thing can happen to API server but it's a much bigger problem if that happens. Agree that using multiprocessing is going to increase the number of API requests, we might want to take a look at doing the requests in batches to reduce the load especially the one's related to nodes since it's a heavy request on a cluster with large number of nodes like @portante mentioned.

Also OpenShift ships with Prometheus as the monitoring solution by default while Kube doesn't, so the prometheus needs to be installed to use Cerberus on Kubernetes ( Just something to keep in mind even though it's just us using Cerberus at the moment ).

Thoughts?

@mffiedler
Copy link
Collaborator

Some thoughts (no solutions!) on recent comments:

  1. repeated oc/apiserver calls for monitoring purpose adds stress to a control plane already under stress from the test in progress. promsql is a trailing source for status/metrics as indicated in previous comments but is probably not reliable for go/no-go. Although, if scrape interval == cerberus sleep interval we'd have to test the differences. It would still rely on healthy ingress (see 2, below). The only other "real-time" source for some of the data outside of apiserver calls might be etcdctl

  2. Another side-channel indicator of health (beyond etcdctl) is ingress route health. Good discussion is going on in Monitor application/ingress route availability (OpenShift specific monitor) #48 that should help. It does not help with monitors that need good information from a healthy ingress (prometheus).

  3. using "oc watch" in cerberus. In my experience watches are useful but fragile - they can be broken for a number of reasons, so need to code accordingly. Repeatedly broken watches are a possible sign of a cluster in distress, so a "meta" indicator of a no-go signal, but that's for later. The watch does save repeated new apiserver calls, but setting up a complex Watcher/Observer pattern for cerberus monitors would be work.

@chaitanyaenr
Copy link
Collaborator Author

Here are a few enhancements which can help in improving the scalability of Ceberus:

The way the current implementation of Cerberus monitors things like nodes, objects in a namespace is as follows:

  • It queries API to get the list of nodes or cluster operators or objects in a namespace.
  • It then loops through the list to get the health status ( an API call for each object ).

We can cache the first step in an in-memory database like Redis as the list of the nodes and pods in the system namespaces are not going to change very often and re-use them till they expire ( we can set it to something like 15 mins depending on the need ). So, instead of making an API call to Kubernetes to get the list each time, we will query redis till it expires after which we will hit Kube API to get the refreshed/latest list. This will reduce the number of API calls per iteration and timing by a significant amount.

Also, Instead of looping through the list and making an API call to get the status for each object, we can modify it to make a single request to dump the yaml or json of all objects in a namespace for example and then scrape it to determine the status of the respective objects. This will also reduce the number of API calls per iteration. We need to make sure to set the appropriate chunk-size to return large lists in chunks rather than all at once.

@gnunn1
Copy link

gnunn1 commented Jan 29, 2021

I have a customer that is looking to provide a single go/no-go signal to UptimeRobot for a production cluster. They were looking at various to aggregate cluster health but it looks like Cerberus does this OOTB. I installed it against my personal cluster and it works great with UptimeRobot.

However I'm wondering if this issue potentially makes it unsuitable for this use case, any thoughts? Customer's cluster is 40-50 bare metal servers.

@mffiedler
Copy link
Collaborator

There have been several performance improvements since this issue was created. @chaitanyaenr @mohit-sheth Can you comment on performance in medium/large clusters recently?

@chaitanyaenr
Copy link
Collaborator Author

chaitanyaenr commented Jan 29, 2021

We added a bunch of enhancements including Multiprocessing, reducing the number of calls to the API and reducing the load on the API server for the bulk requests to help with the scalability like @mffiedler mentioned and have used it without any issues on a 500 node cluster. Closing the issue since the scalability issues should be resolved now. @gnunn1 please feel free to let us know in case of any issues and we will be more than happy to resolve them. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority This issue needs to be fixed ASAP
Projects
None yet
Development

No branches or pull requests

5 participants