Rollover current chunk during shutdown #923

vthacker · 2024-05-09T23:05:26Z

Summary

Every time we do an an indexer deploy, data goes missing for ~10 mins till the recovery task completes. Our users notice this and we want to address this.

Our k8's timeout for a pod graceful shutdow can be set to 180s max internally, after which the pod will be killed forcibly.

Astra now uses S3 native library (CRT client) which makes uploading/downloading chunks from S3 at close to the theoretical max of the underlying host.

Quick math:

We configure a chunk rollover size is 15GB. Let's assume the worst case we are almost at the limit when the shutdown hook is called.

We run our indexers on r5d.24xlarge nodes which have a 25 Gbps Network Bandwidth. Let's assume we have 25 indexer pods on this host that need to upload their chunks before shutdown.

meaning each pod gets roughly 1Gbps on average. 15GB @ 1Gbps comes to exactly 2 mins.

Worst case scenario of 15GB chunk we may or may not make it. But if you take an average chunk size of 10GB when the shutdown hook is called we have a very good chance of succeeding.

astra/src/main/java/com/slack/astra/server/Astra.java

vthacker · 2024-05-22T02:33:26Z

~~Currently the effort is blocked by aws/aws-sdk-java-v2#3963~~

That bug report is resolved. So during startup we can call CRT.acquireShutdownRef(); and then in the shutdownhook after we've uploaded data to S3 call CRT.releaseShutdownRef();

github-actions · 2024-06-22T01:49:35Z

This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 3 days.

mansu · 2024-06-28T06:51:07Z

In the past this change was not possible since the k8s timeout couldn't be longer than 30secs after a shutdown was issued. Are longer timeouts on k8s not longer an issue?

Alternatively, a better solution is to incrementally upload lucene segments to S3? That would solve 2 problems for us:

On a shutdown we would need to upload newly created chunks and not all the chunks. This makes the shutdowns and deployments fast.
This incremental chunk upload functionality can be used for features like making the indexer highly available in the future.

github-actions · 2024-07-29T01:55:06Z

This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 3 days.

vthacker force-pushed the attempt_to_upload_existing_chunk branch 2 times, most recently from 3432c20 to 365d7b4 Compare May 9, 2024 23:17

vthacker commented May 9, 2024

View reviewed changes

astra/src/main/java/com/slack/astra/server/Astra.java Outdated Show resolved Hide resolved

vthacker force-pushed the attempt_to_upload_existing_chunk branch 12 times, most recently from c66f2f3 to 5a35537 Compare May 17, 2024 21:47

vthacker added 2 commits May 20, 2024 10:44

rollover existing chunk

b13780f

changes

e0273fb

vthacker marked this pull request as ready for review May 22, 2024 02:33

vthacker changed the title ~~WIP: rollover current chunk during shutdown~~ Rollover current chunk during shutdown May 22, 2024

changes2

17915ac

vthacker force-pushed the attempt_to_upload_existing_chunk branch from 5a35537 to 17915ac Compare May 22, 2024 02:45

github-actions bot added the Stale label Jun 22, 2024

bryanlb removed the Stale label Jun 24, 2024

Merge branch 'master' into attempt_to_upload_existing_chunk

729f51a

github-actions bot added the Stale label Jul 29, 2024

github-actions bot closed this Aug 1, 2024

bryanlb deleted the attempt_to_upload_existing_chunk branch August 26, 2024 17:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rollover current chunk during shutdown #923

Rollover current chunk during shutdown #923

vthacker commented May 9, 2024 •

edited

Loading

vthacker commented May 22, 2024 •

edited

Loading

github-actions bot commented Jun 22, 2024

mansu commented Jun 28, 2024

github-actions bot commented Jul 29, 2024

Rollover current chunk during shutdown #923

Rollover current chunk during shutdown #923

Conversation

vthacker commented May 9, 2024 • edited Loading

Summary

vthacker commented May 22, 2024 • edited Loading

github-actions bot commented Jun 22, 2024

mansu commented Jun 28, 2024

github-actions bot commented Jul 29, 2024

vthacker commented May 9, 2024 •

edited

Loading

vthacker commented May 22, 2024 •

edited

Loading