Downtime during history deployment rolling updates #4397

alexfjw · 2021-08-20T11:15:31Z

alexfjw
Aug 20, 2021

Is it possible to perform rolling updates without downtime?

We are using the banzai cloud helm chart with some small changes on an EKS cluster.

For some reason, we do notice some downtime when performing a rolling update for history deployments, or when a history pod is evicted (even though we have over-provisioned with a total of 4 history pods). No activities are started during the period, even though the eviction process affects only 1 history pod. Not sure if we have something setup non-optimally.

Would adjusting the terminationGracePeriodSeconds value in the k8s deployment help?

longquanzheng · 2021-08-24T03:58:20Z

longquanzheng
Aug 24, 2021

The availability drop is inevitable during history node deployment. This is because each node owns a history shards which performs the critical path for workflows within the shard.

But it shouldn’t be 100% down time if doing rolling deployment. Cadence uses consistent hashing ring for shard assignment so the shard changes is minimum during deployment. Technically it should be only 25% of down time if rolling deployment is performed correctly.

One thing may help is to make sure give some time to stabilize the ring after each node comes up. Too fas rolling deployment may cause the issue like you describe, because the ring doesn’t have enough time to get stable.

0 replies

alexfjw · 2022-01-30T06:23:33Z

alexfjw
Jan 30, 2022
Author

If anyone else has this issue, one possibility for having downtime is not setting the readinessProbe settings in the banzai cloud k8s helm chart to appropriate values:
It may be that requests are going to a history pod that hasn't fully started up.
It would also be worth adding a startupProbe to the history deployment.

1 reply

longquanzheng Jan 30, 2022

Thanks for this update!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Downtime during history deployment rolling updates #4397

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Downtime during history deployment rolling updates #4397

alexfjw Aug 20, 2021

Replies: 2 comments · 1 reply

longquanzheng Aug 24, 2021

alexfjw Jan 30, 2022 Author

longquanzheng Jan 30, 2022

alexfjw
Aug 20, 2021

Replies: 2 comments 1 reply

longquanzheng
Aug 24, 2021

alexfjw
Jan 30, 2022
Author