-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rollover current chunk during shutdown #923
Conversation
3432c20
to
365d7b4
Compare
c66f2f3
to
5a35537
Compare
That bug report is resolved. So during startup we can call |
5a35537
to
17915ac
Compare
This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 3 days. |
In the past this change was not possible since the k8s timeout couldn't be longer than 30secs after a shutdown was issued. Are longer timeouts on k8s not longer an issue? Alternatively, a better solution is to incrementally upload lucene segments to S3? That would solve 2 problems for us:
|
This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 3 days. |
Summary
Every time we do an an indexer deploy, data goes missing for ~10 mins till the recovery task completes. Our users notice this and we want to address this.
Our k8's timeout for a pod graceful shutdow can be set to 180s max internally, after which the pod will be killed forcibly.
Astra now uses S3 native library (CRT client) which makes uploading/downloading chunks from S3 at close to the theoretical max of the underlying host.
Quick math:
We configure a chunk rollover size is 15GB. Let's assume the worst case we are almost at the limit when the shutdown hook is called.
We run our indexers on
r5d.24xlarge
nodes which have a 25 Gbps Network Bandwidth. Let's assume we have 25 indexer pods on this host that need to upload their chunks before shutdown.meaning each pod gets roughly 1Gbps on average. 15GB @ 1Gbps comes to exactly 2 mins.
Worst case scenario of 15GB chunk we may or may not make it. But if you take an average chunk size of 10GB when the shutdown hook is called we have a very good chance of succeeding.