Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partitions in Zeebe are stuck at 100% backpressure forever #4482

Closed
1 task
ashprojects opened this issue Jul 6, 2024 · 3 comments
Closed
1 task

Partitions in Zeebe are stuck at 100% backpressure forever #4482

ashprojects opened this issue Jul 6, 2024 · 3 comments
Labels
type:bug Issues that describe a user-facing bug in the project.

Comments

@ashprojects
Copy link

ashprojects commented Jul 6, 2024

Environment (Required on creation)

Zeebe: 8.5.2
Total Partitions: 16
Nodes: 8
Each Zeebe node is 16GB and 4Core pod

Description (Required on creation; please attach any relevant screenshots, stacktraces, log files, etc. to the ticket)

We have noticed that some partitions would permanently start firing backpressure 100% even though load is limited.
image

We see all partitions as healthy but backpressure % is 100 for some of the partitions
image

On some metric observations I see this
image

Job activated per second is also 0
image

PVC / CPU / Memory is normal
image
image

Some stack traces from one of the brokers

 2024-07-06 12:40:52.846 [Broker-0] [raft-server-0-7] [raft-server-7] INFO                                                                                                                                                                      │                                        │
│       io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-7}{role=FOLLOWER} - No heartbeat from a known leader since 1987854ms                                                                                             │                                        │
│ 2024-07-06 12:40:52.846 [Broker-0] [raft-server-0-7] [raft-server-7] INFO                                                                                                                                                                      │                                        │
│       io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-7}{role=FOLLOWER} - Sending poll requests to all active members: [DefaultRaftMember{id=6, type=ACTIVE, updated=2024-07-05T07:26:31.484Z}, DefaultRaftMember{id=7 │                                        │
│ 2024-07-06 12:40:55.347 [Broker-0] [raft-server-0-7] [raft-server-7] WARN                                                                                                                                                                      │                                        │
│       io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-7}{role=FOLLOWER} - Poll request to 6 failed: java.util.concurrent.TimeoutException: Request raft-partition-partition-7-poll to camunda-zeebe-6.camunda-zeebe.ca │                                        │
│ 2024-07-06 12:40:55.347 [Broker-0] [raft-server-0-7] [raft-server-7] WARN                                                                                                                                                                      │                                        │
│       io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-7}{role=FOLLOWER} - Poll request to 7 failed: java.util.concurrent.TimeoutException: Request raft-partition-partition-7-poll to camunda-zeebe-7.camunda-zeebe.ca │                                        │
│ 2024-07-06 12:41:02.847 [Broker-0] [raft-server-0-7] [raft-server-7] INFO                                                                                                                                                                      │                                        │
│       io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-7}{role=FOLLOWER} - No heartbeat from a known leader since 1997855ms                                                                                             │                                        │
│ 2024-07-06 12:41:02.848 [Broker-0] [raft-server-0-7] [raft-server-7] INFO                                                                                                                                                                      │                                        │
│       io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-7}{role=FOLLOWER} - Sending poll requests to all active members: [DefaultRaftMember{id=6, type=ACTIVE, updated=2024-07-05T07:26:31.484Z}, DefaultRaftMember{id=7 │                                        │
│ 2024-07-06 12:41:05.348 [Broker-0] [raft-server-0-7] [raft-server-7] WARN                                                                                                                                                                      │                                        │
│       io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-7}{role=FOLLOWER} - Poll request to 6 failed: java.util.concurrent.TimeoutException: Request raft-partition-partition-7-poll to camunda-zeebe-6.camunda-zeebe.ca │                                        │
│ 2024-07-06 12:41:05.349 [Broker-0] [raft-server-0-7] [raft-server-7] WARN                                                                                                                                                                      │                                        │
│       io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-7}{role=FOLLOWER} - Poll request to 7 failed: java.util.concurrent.TimeoutException: Request raft-partition-partition-7-poll to camunda-zeebe-7.camunda-zeebe.ca │                                        │
│ 2024-07-06 12:41:06.344 [Broker-0] [zb-actors-2] [InterPartitionCommandReceiverActor-8] WARN                                                                                                                                                   │                                        │

Steps to reproduce (Required on creation)

Not really sure

Observed Behavior (Required on creation)

Partitions are stuck with backpressure 100% and system is not responding

Expected behavior (Required on creation)

Backpressure should be auto released and should start taking

Root Cause (Required on prioritization)

Solution Ideas

Hints

Links

Attached logs when this happened. Note time is in UTC. Diagrams have IST, so UTC + 5.30. Sorry for this

logs-insights-results (4).csv

Breakdown

Pull Requests

No tasks being tracked yet.

Dev2QA handover

  • Does this ticket need a QA test and the testing goals are not clear from the description? Add a Dev2QA handover comment
@ashprojects ashprojects added the type:bug Issues that describe a user-facing bug in the project. label Jul 6, 2024
@ashprojects
Copy link
Author

ashprojects commented Jul 6, 2024

For more info:

  1. We have elastic-search exporters enabled.
  2. We have deployed this in AWS-EKS and all zeebe pods are in different nodes
  3. This happened few days ago, I did a restart and it went back up. But now it happened again
  4. We have around 10M process instances, each of them waiting for a message correlation. According to business use case those messages are released to continue ahead
  5. Some exceptions when we have this
2024-07-06 12:54:44.573 [Broker-4] [zb-actors-0] [] WARN                                                                                                                                                                                       │
│       io.camunda.zeebe.topology.gossip.ClusterTopologyGossiper - Failed to sync with 6                                                                                                                                                         │
│ java.util.concurrent.CompletionException: io.atomix.cluster.messaging.MessagingException$RemoteHandlerFailure: Remote handler failed to handle message, cause: Failed to handle message, host camunda-zeebe-4.camunda-zeebe.camunda.svc:26502  │
│     at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source) ~[?:?]                                                                                                                                                 │
│     at java.base/java.util.concurrent.CompletableFuture.completeThrowable(Unknown Source) ~[?:?]                                                                                                                                               │
│     at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(Unknown Source) ~[?:?]                                                                                                                                                │
│     at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source) ~[?:?]                                                                                                                                                    │
│     at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source) ~[?:?]                                                                                                                                           │
│     at io.atomix.cluster.messaging.impl.NettyMessagingService.lambda$executeOnPooledConnection$25(NettyMessagingService.java:626) ~[zeebe-atomix-cluster-8.5.2.jar:8.5.2]                                                                      │
│     at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31) ~[guava-33.1.0-jre.jar:?]                                                                                                                              │
│     at io.atomix.cluster.messaging.impl.NettyMessagingService.lambda$executeOnPooledConnection$26(NettyMessagingService.java:624) ~[zeebe-atomix-cluster-8.5.2.jar:8.5.2]                                                                      │
│     at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source) ~[?:?]                                                                                                                                                 │
│     at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source) ~[?:?]                                                                                                                                         │
│     at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source) ~[?:?]                                                                                                                                                    │
│     at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source) ~[?:?]                                                                                                                                           │
│     at io.atomix.cluster.messaging.impl.AbstractClientConnection.dispatch(AbstractClientConnection.java:48) ~[zeebe-atomix-cluster-8.5.2.jar:8.5.2]                                                                                            │
│     at io.atomix.cluster.messaging.impl.AbstractClientConnection.dispatch(AbstractClientConnection.java:29) ~[zeebe-atomix-cluster-8.5.2.jar:8.5.2]                                                                                            │
│     at io.atomix.cluster.messaging.impl.NettyMessagingService$MessageDispatcher.channelRead0(NettyMessagingService.java:1109) ~[zeebe-atomix-cluster-8.5.2.jar:8.5.2]                                                                          │
│     at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]                                                                                        │
│     at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]                                                                             │
│     at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]                                                                             │
│     at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]                                                                               │
│     at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346) ~[netty-codec-4.1.110.Final.jar:4.1.110.Final]                                                                                               │
│     at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318) ~[netty-codec-4.1.110.Final.jar:4.1.110.Final]                                                                                                   │
│     at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]                                                                             │
│     at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]                                                                             │
│     at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]                                                                               │
│     at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1407) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]                                                                                    │
│     at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]                                                                             │
│     at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]                                                                             │
│     at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:918) ~[netty-transport-4.1.110.Final.jar:4.1.110.Final]                                                                                             │
│     at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:799) ~[netty-transport-classes-epoll-4.1.110.Final.jar:4.1.110.Final]                                                  │
│     at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:501) ~[netty-transport-classes-epoll-4.1.110.Final.jar:4.1.110.Final]                                                                                            │
│     at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:399) ~[netty-transport-classes-epoll-4.1.110.Final.jar:4.1.110.Final]                                                                                                     │
│     at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:994) ~[netty-common-4.1.110.Final.jar:4.1.110.Final]                                                                                            │
│     at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.110.Final.jar:4.1.110.Final]                                                                                                               │
│     at java.base/java.lang.Thread.run(Unknown Source) ~[?:?]                                                                                                                                                                                   │
│ Caused by: io.atomix.cluster.messaging.MessagingException$RemoteHandlerFailure: Remote handler failed to handle message, cause: Failed to handle message, host camunda-zeebe-4.camunda-zeebe.camunda.svc:26502 is not a known cluster member   │
│     ... 22 more                                                                                                                                                                                                                                │
│                                                                                                                            

@ashprojects
Copy link
Author

ashprojects commented Jul 6, 2024

[Update] I had to manually analyse topology and restart throttling leader partition brokers one by one.
This is happening twice a day, that too in production. Something is definitely not right here.

I noticed a pattern, whenever we run a backup, we see this state.
100% backpressure noticed when backup is scheduled - which fails by itself eventually.
Could it be when backup is taken and we have load on the system, where electionTimeout of 2500ms is reached, we would get this state

@ashprojects ashprojects closed this as not planned Won't fix, can't repro, duplicate, stale Jul 7, 2024
@yanavasileva
Copy link
Member

The ticket was incorrectly opened for Camunda 7. The user already reported the ticket for Camunda 8: camunda/camunda#20126

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:bug Issues that describe a user-facing bug in the project.
Projects
None yet
Development

No branches or pull requests

2 participants