Partitions in Zeebe are stuck at 100% backpressure forever #4482

ashprojects · 2024-07-06T12:41:54Z

Environment (Required on creation)

Zeebe: 8.5.2
Total Partitions: 16
Nodes: 8
Each Zeebe node is 16GB and 4Core pod

Description (Required on creation; please attach any relevant screenshots, stacktraces, log files, etc. to the ticket)

We have noticed that some partitions would permanently start firing backpressure 100% even though load is limited.

We see all partitions as healthy but backpressure % is 100 for some of the partitions

On some metric observations I see this

Job activated per second is also 0

PVC / CPU / Memory is normal

Some stack traces from one of the brokers

 2024-07-06 12:40:52.846 [Broker-0] [raft-server-0-7] [raft-server-7] INFO                                                                                                                                                                      │                                        │
│       io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-7}{role=FOLLOWER} - No heartbeat from a known leader since 1987854ms                                                                                             │                                        │
│ 2024-07-06 12:40:52.846 [Broker-0] [raft-server-0-7] [raft-server-7] INFO                                                                                                                                                                      │                                        │
│       io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-7}{role=FOLLOWER} - Sending poll requests to all active members: [DefaultRaftMember{id=6, type=ACTIVE, updated=2024-07-05T07:26:31.484Z}, DefaultRaftMember{id=7 │                                        │
│ 2024-07-06 12:40:55.347 [Broker-0] [raft-server-0-7] [raft-server-7] WARN                                                                                                                                                                      │                                        │
│       io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-7}{role=FOLLOWER} - Poll request to 6 failed: java.util.concurrent.TimeoutException: Request raft-partition-partition-7-poll to camunda-zeebe-6.camunda-zeebe.ca │                                        │
│ 2024-07-06 12:40:55.347 [Broker-0] [raft-server-0-7] [raft-server-7] WARN                                                                                                                                                                      │                                        │
│       io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-7}{role=FOLLOWER} - Poll request to 7 failed: java.util.concurrent.TimeoutException: Request raft-partition-partition-7-poll to camunda-zeebe-7.camunda-zeebe.ca │                                        │
│ 2024-07-06 12:41:02.847 [Broker-0] [raft-server-0-7] [raft-server-7] INFO                                                                                                                                                                      │                                        │
│       io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-7}{role=FOLLOWER} - No heartbeat from a known leader since 1997855ms                                                                                             │                                        │
│ 2024-07-06 12:41:02.848 [Broker-0] [raft-server-0-7] [raft-server-7] INFO                                                                                                                                                                      │                                        │
│       io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-7}{role=FOLLOWER} - Sending poll requests to all active members: [DefaultRaftMember{id=6, type=ACTIVE, updated=2024-07-05T07:26:31.484Z}, DefaultRaftMember{id=7 │                                        │
│ 2024-07-06 12:41:05.348 [Broker-0] [raft-server-0-7] [raft-server-7] WARN                                                                                                                                                                      │                                        │
│       io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-7}{role=FOLLOWER} - Poll request to 6 failed: java.util.concurrent.TimeoutException: Request raft-partition-partition-7-poll to camunda-zeebe-6.camunda-zeebe.ca │                                        │
│ 2024-07-06 12:41:05.349 [Broker-0] [raft-server-0-7] [raft-server-7] WARN                                                                                                                                                                      │                                        │
│       io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-7}{role=FOLLOWER} - Poll request to 7 failed: java.util.concurrent.TimeoutException: Request raft-partition-partition-7-poll to camunda-zeebe-7.camunda-zeebe.ca │                                        │
│ 2024-07-06 12:41:06.344 [Broker-0] [zb-actors-2] [InterPartitionCommandReceiverActor-8] WARN                                                                                                                                                   │                                        │

Steps to reproduce (Required on creation)

Not really sure

Observed Behavior (Required on creation)

Partitions are stuck with backpressure 100% and system is not responding

Expected behavior (Required on creation)

Backpressure should be auto released and should start taking

Root Cause (Required on prioritization)

Solution Ideas

Hints

Links

Attached logs when this happened. Note time is in UTC. Diagrams have IST, so UTC + 5.30. Sorry for this

logs-insights-results (4).csv

Breakdown

Pull Requests

Give feedback

No tasks being tracked yet.

Options

Dev2QA handover

Does this ticket need a QA test and the testing goals are not clear from the description? Add a Dev2QA handover comment

The text was updated successfully, but these errors were encountered:

ashprojects · 2024-07-06T12:58:33Z

For more info:

We have elastic-search exporters enabled.
We have deployed this in AWS-EKS and all zeebe pods are in different nodes
This happened few days ago, I did a restart and it went back up. But now it happened again
We have around 10M process instances, each of them waiting for a message correlation. According to business use case those messages are released to continue ahead
Some exceptions when we have this

2024-07-06 12:54:44.573 [Broker-4] [zb-actors-0] [] WARN                                                                                                                                                                                       │
│       io.camunda.zeebe.topology.gossip.ClusterTopologyGossiper - Failed to sync with 6                                                                                                                                                         │
│ java.util.concurrent.CompletionException: io.atomix.cluster.messaging.MessagingException$RemoteHandlerFailure: Remote handler failed to handle message, cause: Failed to handle message, host camunda-zeebe-4.camunda-zeebe.camunda.svc:26502  │
│     at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source) ~[?:?]                                                                                                                                                 │
│     at java.base/java.util.concurrent.CompletableFuture.completeThrowable(Unknown Source) ~[?:?]                                                                                                                                               │
│     at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(Unknown Source) ~[?:?]                                                                                                                                                │
│     at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source) ~[?:?]                                                                                                                                                    │
│     at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source) ~[?:?]                                                                                                                                           │
│     at io.atomix.cluster.messaging.impl.NettyMessagingService.lambda$executeOnPooledConnection$25(NettyMessagingService.java:626) ~[zeebe-atomix-cluster-8.5.2.jar:8.5.2]                                                                      │
│     at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31) ~[guava-33.1.0-jre.jar:?]                                                                                                                              │
│     at io.atomix.cluster.messaging.impl.NettyMessagingService.lambda$executeOnPooledConnection$26(NettyMessagingService.java:624) ~[zeebe-atomix-cluster-8.5.2.jar:8.5.2]                                                                      │
│     at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source) ~[?:?]                                                                                                                                                 │
│     at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source) ~[?:?]                                                                                                                                         │
│     at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source) ~[?:?]                                                                                                                                                    │
│     at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source) ~[?:?]                                                                                                                                           │
│     at io.atomix.cluster.messaging.impl.AbstractClientConnection.dispatch(AbstractClientConnection.java:48) ~[zeebe-atomix-cluster-8.5.2.jar:8.5.2]                                                                                            │
│     at io.atomix.cluster.messaging.impl.AbstractClientConnection.dispatch(AbstractClientConnection.java:29) ~[zeebe-atomix-cluster-8.5.2.jar:8.5.2]                                                                                            │
│     at io.atomix.cluster.messaging.impl.NettyMessagingService$MessageDispatcher.channelRead0(NettyMessagingService.java:1109) ~[zeebe-atomix-cluster-8.5.2.jar:8.5.2]                                                                          │
│     at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]                                                                                        │
│     at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]                                                                             │
│     at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]                                                                             │
│     at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]                                                                               │
│     at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346) ~[netty-codec-4.1.110.Final.jar:4.1.110.Final]                                                                                               │
│     at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318) ~[netty-codec-4.1.110.Final.jar:4.1.110.Final]                                                                                                   │
│     at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]                                                                             │
│     at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]                                                                             │
│     at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]                                                                               │
│     at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1407) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]                                                                                    │
│     at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]                                                                             │
│     at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]                                                                             │
│     at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:918) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]                                                                                             │
│     at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:799) ~[netty-transport-classes-epoll-4.1.110.Final.jar:4.1.110.Final]                                                  │
│     at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:501) ~[netty-transport-classes-epoll-4.1.110.Final.jar:4.1.110.Final]                                                                                            │
│     at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:399) ~[netty-transport-classes-epoll-4.1.110.Final.jar:4.1.110.Final]                                                                                                     │
│     at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:994) ~[netty-common-4.1.110.Final.jar:4.1.110.Final]                                                                                            │
│     at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.110.Final.jar:4.1.110.Final]                                                                                                               │
│     at java.base/java.lang.Thread.run(Unknown Source) ~[?:?]                                                                                                                                                                                   │
│ Caused by: io.atomix.cluster.messaging.MessagingException$RemoteHandlerFailure: Remote handler failed to handle message, cause: Failed to handle message, host camunda-zeebe-4.camunda-zeebe.camunda.svc:26502 is not a known cluster member   │
│     ... 22 more                                                                                                                                                                                                                                │
│

ashprojects · 2024-07-06T13:20:41Z

[Update] I had to manually analyse topology and restart throttling leader partition brokers one by one.
This is happening twice a day, that too in production. Something is definitely not right here.

I noticed a pattern, whenever we run a backup, we see this state.
100% backpressure noticed when backup is scheduled - which fails by itself eventually.
Could it be when backup is taken and we have load on the system, where electionTimeout of 2500ms is reached, we would get this state

yanavasileva · 2024-07-08T06:43:56Z

The ticket was incorrectly opened for Camunda 7. The user already reported the ticket for Camunda 8: camunda/camunda#20126

ashprojects added the type:bug Issues that describe a user-facing bug in the project. label Jul 6, 2024

ashprojects closed this as not planned Won't fix, can't repro, duplicate, stale Jul 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partitions in Zeebe are stuck at 100% backpressure forever #4482

Partitions in Zeebe are stuck at 100% backpressure forever #4482

ashprojects commented Jul 6, 2024 •

edited

Loading

Pull Requests

ashprojects commented Jul 6, 2024 •

edited

Loading

ashprojects commented Jul 6, 2024 •

edited

Loading

yanavasileva commented Jul 8, 2024

Partitions in Zeebe are stuck at 100% backpressure forever #4482

Partitions in Zeebe are stuck at 100% backpressure forever #4482

Comments

ashprojects commented Jul 6, 2024 • edited Loading

Environment (Required on creation)

Description (Required on creation; please attach any relevant screenshots, stacktraces, log files, etc. to the ticket)

Steps to reproduce (Required on creation)

Observed Behavior (Required on creation)

Expected behavior (Required on creation)

Root Cause (Required on prioritization)

Solution Ideas

Hints

Links

Breakdown

Pull Requests

Dev2QA handover

ashprojects commented Jul 6, 2024 • edited Loading

ashprojects commented Jul 6, 2024 • edited Loading

yanavasileva commented Jul 8, 2024

ashprojects commented Jul 6, 2024 •

edited

Loading

ashprojects commented Jul 6, 2024 •

edited

Loading

ashprojects commented Jul 6, 2024 •

edited

Loading