Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rollover current chunk during shutdown #923

Closed
wants to merge 4 commits into from

Conversation

vthacker
Copy link
Contributor

@vthacker vthacker commented May 9, 2024

Summary

Every time we do an an indexer deploy, data goes missing for ~10 mins till the recovery task completes. Our users notice this and we want to address this.

Our k8's timeout for a pod graceful shutdow can be set to 180s max internally, after which the pod will be killed forcibly.

Astra now uses S3 native library (CRT client) which makes uploading/downloading chunks from S3 at close to the theoretical max of the underlying host.

Quick math:

We configure a chunk rollover size is 15GB. Let's assume the worst case we are almost at the limit when the shutdown hook is called.

We run our indexers on r5d.24xlarge nodes which have a 25 Gbps Network Bandwidth. Let's assume we have 25 indexer pods on this host that need to upload their chunks before shutdown.

meaning each pod gets roughly 1Gbps on average. 15GB @ 1Gbps comes to exactly 2 mins.

Worst case scenario of 15GB chunk we may or may not make it. But if you take an average chunk size of 10GB when the shutdown hook is called we have a very good chance of succeeding.

@vthacker vthacker force-pushed the attempt_to_upload_existing_chunk branch 2 times, most recently from 3432c20 to 365d7b4 Compare May 9, 2024 23:17
@vthacker vthacker force-pushed the attempt_to_upload_existing_chunk branch 12 times, most recently from c66f2f3 to 5a35537 Compare May 17, 2024 21:47
@vthacker
Copy link
Contributor Author

vthacker commented May 22, 2024

Currently the effort is blocked by aws/aws-sdk-java-v2#3963

That bug report is resolved. So during startup we can call CRT.acquireShutdownRef(); and then in the shutdownhook after we've uploaded data to S3 call CRT.releaseShutdownRef();

@vthacker vthacker marked this pull request as ready for review May 22, 2024 02:33
@vthacker vthacker changed the title WIP: rollover current chunk during shutdown Rollover current chunk during shutdown May 22, 2024
@vthacker vthacker force-pushed the attempt_to_upload_existing_chunk branch from 5a35537 to 17915ac Compare May 22, 2024 02:45
Copy link

This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 3 days.

@github-actions github-actions bot added the Stale label Jun 22, 2024
@bryanlb bryanlb removed the Stale label Jun 24, 2024
@mansu
Copy link
Contributor

mansu commented Jun 28, 2024

In the past this change was not possible since the k8s timeout couldn't be longer than 30secs after a shutdown was issued. Are longer timeouts on k8s not longer an issue?

Alternatively, a better solution is to incrementally upload lucene segments to S3? That would solve 2 problems for us:

  • On a shutdown we would need to upload newly created chunks and not all the chunks. This makes the shutdowns and deployments fast.
  • This incremental chunk upload functionality can be used for features like making the indexer highly available in the future.

Copy link

This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 3 days.

@github-actions github-actions bot added the Stale label Jul 29, 2024
@github-actions github-actions bot closed this Aug 1, 2024
@bryanlb bryanlb deleted the attempt_to_upload_existing_chunk branch August 26, 2024 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants