Replies: 2 comments 1 reply
-
The availability drop is inevitable during history node deployment. This is because each node owns a history shards which performs the critical path for workflows within the shard. But it shouldn’t be 100% down time if doing rolling deployment. Cadence uses consistent hashing ring for shard assignment so the shard changes is minimum during deployment. Technically it should be only 25% of down time if rolling deployment is performed correctly. One thing may help is to make sure give some time to stabilize the ring after each node comes up. Too fas rolling deployment may cause the issue like you describe, because the ring doesn’t have enough time to get stable. |
Beta Was this translation helpful? Give feedback.
-
If anyone else has this issue, one possibility for having downtime is not setting the readinessProbe settings in the banzai cloud k8s helm chart to appropriate values: |
Beta Was this translation helpful? Give feedback.
-
Is it possible to perform rolling updates without downtime?
We are using the banzai cloud helm chart with some small changes on an EKS cluster.
For some reason, we do notice some downtime when performing a rolling update for history deployments, or when a history pod is evicted (even though we have over-provisioned with a total of 4 history pods). No activities are started during the period, even though the eviction process affects only 1 history pod. Not sure if we have something setup non-optimally.
Would adjusting the
terminationGracePeriodSeconds
value in the k8s deployment help?Beta Was this translation helpful? Give feedback.
All reactions