Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Druid components autoscaling best practices #40

Open
itamar-marom opened this issue Apr 18, 2023 · 5 comments
Open

Druid components autoscaling best practices #40

itamar-marom opened this issue Apr 18, 2023 · 5 comments
Labels
documentation Improvements or additions to documentation question Further information is requested

Comments

@itamar-marom
Copy link
Collaborator

itamar-marom commented Apr 18, 2023

Started in this slack thread

We need an answer for how to scale each component of Druid.
Middle Managers are on the way to becoming dynamically provisioned which will solve this.
The biggest problem is autoscaling historicals where we should also take storage into our calculation.
Should the operator handle that? Should we have a smart third-party auto scaler (like KEDA)?

@sbashar
Copy link

sbashar commented Apr 19, 2023

The configuration of Apache Druid Historical nodes is relatively more intricate compared to that of Middle Manager nodes. While it may seem straightforward to establish guidelines for scaling up when disk usage exceeds 80%, the scaling process for Historical nodes also necessitates careful consideration of CPU and memory usage. As a result, scaling Historical nodes involves a significant amount of complex logic. This can be an interesting project to work on.

@saithal-confluent
Copy link
Contributor

Scaling down of components like MiddleManagers/Indexers and Historicals would need both custom logic in the operator as well as longer wait times based on how they are to be removed out of service without data loss or service impact.

If someone has started the work on the autoscaling of Druid components, or is yet to start, can they loop me in as well please?

@AdheipSingh
Copy link
Contributor

@saithal-confluent you can join #druid-operator channel in kubernetes slack.

Also we are planning to start a working group on druid-operator.

@itamar-marom
Copy link
Collaborator Author

Anyone who wants to help - I would love to hear your thoughts on the Druid historical nodes autoscaling. To start the conversation I'll add some initial considerations.

Rules:

  • Scaling will be done 1 by 1

Storage Concerns

When faced with a situation where you have a large volume of data and need to accommodate it in your Apache Druid cluster, you have two main options: adding more historical nodes or resizing the volumes of the existing historical nodes.

Options 1: size up the disk in those situations.

The downside is that if there is a need for another historical, it will add lots of storage to the cluster - and we cannot size down the disk. Also, too large volumes may affect the performance of queries.

Option 2: scale-out with the same storage size

The downside is that it can lead to many historical nodes and can affect the performance of other components.

When to scale out? - requires one condition

Performance degradation -
CPU throttling
RAM
Disk IO
Latency
No other component needs to scale out (?)

When to scale in? - requires all conditions

Data can be stored in fewer nodes (?)
Utilization is low
other components won’t be affected

@itamar-marom
Copy link
Collaborator Author

@cyril-corbon WDYT about this?

@itamar-marom itamar-marom added documentation Improvements or additions to documentation question Further information is requested labels Aug 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants