Cerberus scalability issues #67

chaitanyaenr · 2020-05-27T13:23:35Z

A simple run of Cerberus on a 10 node cluster vs 220 nodes proved that we need to improve the way we run the checks for Cerberus to scale well on a cluster with hundreds/thousands of nodes. The time taken to run the checks has been increasing as we add more checks like @mffiedler mentioned in #53. Here are the observed timings:

10 nodes:
2020-05-22 23:00:17,391 [INFO] -------------------------- Iteration Stats -------------------------------
2020-05-22 23:00:17,392 [INFO] Time taken to run watch_nodes in iteration 1: 0.1057283878326416 seconds
2020-05-22 23:00:17,392 [INFO] Time taken to run watch_cluster_operators in iteration 1: 0.3759939670562744 seconds
2020-05-22 23:00:17,392 [INFO] Time taken to run watch_namespaces in iteration 1: 1.1533699035644531 seconds
2020-05-22 23:00:17,392 [INFO] Time taken to run entire_iteration in iteration 1: 4.368650436401367 seconds
2020-05-22 23:00:17,392 [INFO] --------------------------------------------------------------------------

220 nodes:
2020-05-22 23:13:00,130 [INFO] -------------------------- Iteration Stats -------------------------------
2020-05-22 23:13:00,130 [INFO] Time taken to run watch_nodes in iteration 2: 19.62144660949707 seconds
2020-05-22 23:13:00,131 [INFO] Time taken to run watch_cluster_operators in iteration 2: 1.0622196197509766 seconds
2020-05-22 23:13:00,131 [INFO] Time taken to run watch_namespaces in iteration 2: 68.95069146156311 seconds
2020-05-22 23:13:00,131 [INFO] Time taken to run entire_iteration in iteration 2: 161.81616592407227 seconds
2020-05-22 23:13:00,131 [INFO] --------------------------------------------------------------------------

@portante suggested areas for improvement. Multiprocessing will reduce the timing by using available cores to run checks in parallel ( #60 ) but we also need to take a look at optimizing the code to reduce the number of API calls and loops which iterate over the objects to get the status wherever possible in addition to running checks in parallel like @portante suggested. This issue is to track the observations and discuss ways to make Cerberus scale well on a large and dense cluster.

NOTE: The 10 nodes and 220 nodes clusters were different.

Thoughts?

chaitanyaenr · 2020-05-27T13:26:32Z

cc: @yashashreesuresh @paigerube14 @jtaleric @mffiedler @portante

portante · 2020-05-27T14:30:02Z

Consider updates pulled from master and infra nodes vs compute nodes. You might be want to always watch master and infra nodes, but group compute nodes into smaller sets of nodes that you gather updates from in different iterations.

One might also want to consider randomly sampling compute nodes for updates.

Placing "oc watch" on various objects instead of polling the status, might be a better approach, where you would place a watch first, gather the initial data from each object, and then wait for updates as they change.

aakarshg · 2020-05-27T17:20:08Z

From what I read in the code, today we're relying on client making a request to kube-apiserver which in turn then makes a request to etcd to get the status of cluster - which is either number of nodes, number of namespaces and so on.. This if run in tandem with a control plan heavy test like mastervert which already puts a lot of stress on apiserver could be catastrophic. Multi-threading is good, but then again we're still putting a lot of additional load on the apiserver.

My suggestion would be to move towards using promql queries - @amitsagtani97 is adding support to touchstone for doing aggregations and run to run comparison . touchstone can also be used as a python library, so we can just look at querying prometheus through touchstone and see if anything changes. This will help with scaling, as we're avoiding going through apiserver (then through etcd), as well as not put additional load on components which are already being stressed through workloads. More importantly, we'd be using a database that's built to scale, and super helpful with short term transactions.

aakarshg · 2020-05-27T17:22:03Z

From what I read in the code, today we're relying on client making a request to kube-apiserver which in turn then makes a request to etcd to get the status of cluster - which is either number of nodes, number of namespaces and so on.. This if run in tandem with a control plan heavy test like mastervert which already puts a lot of stress on apiserver could be catastrophic. Multi-threading is good, but then again we're still putting a lot of additional load on the apiserver.

My suggestion would be to move towards using promql queries - @amitsagtani97 is adding support to touchstone for doing aggregations and run to run comparison . touchstone can also be used as a python library, so we can just look at querying prometheus through touchstone and see if anything changes. This will help with scaling, as we're avoiding going through apiserver (then through etcd), as well as not put additional load on components which are already being stressed through workloads. More importantly, we'd be using a database that's built to scale, and super helpful with short term transactions.

Just to keep in mind, we can't replace all the calls to apiserver with a prometheus query, but most of the big ones should be easy to replace, and the others that arent supported we should see if making 2 queries in prom would help us.

chaitanyaenr · 2020-05-27T22:26:40Z

@aakarshg It's a good idea, integrating Prometheus will also help with #65 but depending on Prometheus would mean that Cerberus won't be able to get the information about what's going on with the other components when Prometheus is down even though the cluster is still functional. The same thing can happen to API server but it's a much bigger problem if that happens. Agree that using multiprocessing is going to increase the number of API requests, we might want to take a look at doing the requests in batches to reduce the load especially the one's related to nodes since it's a heavy request on a cluster with large number of nodes like @portante mentioned.

Also OpenShift ships with Prometheus as the monitoring solution by default while Kube doesn't, so the prometheus needs to be installed to use Cerberus on Kubernetes ( Just something to keep in mind even though it's just us using Cerberus at the moment ).

Thoughts?

mffiedler · 2020-05-28T12:13:56Z

Some thoughts (no solutions!) on recent comments:

repeated oc/apiserver calls for monitoring purpose adds stress to a control plane already under stress from the test in progress. promsql is a trailing source for status/metrics as indicated in previous comments but is probably not reliable for go/no-go. Although, if scrape interval == cerberus sleep interval we'd have to test the differences. It would still rely on healthy ingress (see 2, below). The only other "real-time" source for some of the data outside of apiserver calls might be etcdctl
Another side-channel indicator of health (beyond etcdctl) is ingress route health. Good discussion is going on in Monitor application/ingress route availability (OpenShift specific monitor) #48 that should help. It does not help with monitors that need good information from a healthy ingress (prometheus).
using "oc watch" in cerberus. In my experience watches are useful but fragile - they can be broken for a number of reasons, so need to code accordingly. Repeatedly broken watches are a possible sign of a cluster in distress, so a "meta" indicator of a no-go signal, but that's for later. The watch does save repeated new apiserver calls, but setting up a complex Watcher/Observer pattern for cerberus monitors would be work.

chaitanyaenr · 2020-05-29T13:15:37Z

Here are a few enhancements which can help in improving the scalability of Ceberus:

The way the current implementation of Cerberus monitors things like nodes, objects in a namespace is as follows:

It queries API to get the list of nodes or cluster operators or objects in a namespace.
It then loops through the list to get the health status ( an API call for each object ).

We can cache the first step in an in-memory database like Redis as the list of the nodes and pods in the system namespaces are not going to change very often and re-use them till they expire ( we can set it to something like 15 mins depending on the need ). So, instead of making an API call to Kubernetes to get the list each time, we will query redis till it expires after which we will hit Kube API to get the refreshed/latest list. This will reduce the number of API calls per iteration and timing by a significant amount.

Also, Instead of looping through the list and making an API call to get the status for each object, we can modify it to make a single request to dump the yaml or json of all objects in a namespace for example and then scrape it to determine the status of the respective objects. This will also reduce the number of API calls per iteration. We need to make sure to set the appropriate chunk-size to return large lists in chunks rather than all at once.

gnunn1 · 2021-01-29T15:51:14Z

I have a customer that is looking to provide a single go/no-go signal to UptimeRobot for a production cluster. They were looking at various to aggregate cluster health but it looks like Cerberus does this OOTB. I installed it against my personal cluster and it works great with UptimeRobot.

However I'm wondering if this issue potentially makes it unsuitable for this use case, any thoughts? Customer's cluster is 40-50 bare metal servers.

mffiedler · 2021-01-29T16:04:21Z

There have been several performance improvements since this issue was created. @chaitanyaenr @mohit-sheth Can you comment on performance in medium/large clusters recently?

chaitanyaenr · 2021-01-29T16:16:59Z

We added a bunch of enhancements including Multiprocessing, reducing the number of calls to the API and reducing the load on the API server for the bulk requests to help with the scalability like @mffiedler mentioned and have used it without any issues on a 500 node cluster. Closing the issue since the scalability issues should be resolved now. @gnunn1 please feel free to let us know in case of any issues and we will be more than happy to resolve them. Thanks.

chaitanyaenr added the high priority This issue needs to be fixed ASAP label May 27, 2020

chaitanyaenr mentioned this issue Jun 1, 2020

Cache data to reduce the number of API calls #71

Closed

yashashreesuresh mentioned this issue Jun 3, 2020

Added support to reduce the number of API calls per iteration #74

Merged

This was referenced Jun 10, 2020

Reduce the load on API server for bulk requests #80

Merged

Test case in CI to report when a PR is adding more time to the checks #81

Closed

chaitanyaenr closed this as completed Jan 29, 2021

paigerube14 mentioned this issue Aug 11, 2022

Large Scale testing issues: continue and namespace calls krkn-chaos/krkn#286

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cerberus scalability issues #67

Cerberus scalability issues #67

chaitanyaenr commented May 27, 2020 •

edited

Loading

chaitanyaenr commented May 27, 2020

portante commented May 27, 2020

aakarshg commented May 27, 2020

aakarshg commented May 27, 2020

chaitanyaenr commented May 27, 2020 •

edited

Loading

mffiedler commented May 28, 2020

chaitanyaenr commented May 29, 2020

gnunn1 commented Jan 29, 2021

mffiedler commented Jan 29, 2021

chaitanyaenr commented Jan 29, 2021 •

edited

Loading

Cerberus scalability issues #67

Cerberus scalability issues #67

Comments

chaitanyaenr commented May 27, 2020 • edited Loading

chaitanyaenr commented May 27, 2020

portante commented May 27, 2020

aakarshg commented May 27, 2020

aakarshg commented May 27, 2020

chaitanyaenr commented May 27, 2020 • edited Loading

mffiedler commented May 28, 2020

chaitanyaenr commented May 29, 2020

gnunn1 commented Jan 29, 2021

mffiedler commented Jan 29, 2021

chaitanyaenr commented Jan 29, 2021 • edited Loading

chaitanyaenr commented May 27, 2020 •

edited

Loading

chaitanyaenr commented May 27, 2020 •

edited

Loading

chaitanyaenr commented Jan 29, 2021 •

edited

Loading