Skip to content

App remains connected to wrong Redis node after failover which leads to writing against read-only replica #2668

Closed
@mariusstaicu

Description

@mariusstaicu

We're using spring boot app with redis support via lettuce to connect to a redis sentinel deployment in kubernetes. The spring-boot version is 2.7.12.

There are 3 sentinel nodes with 3 redis nodes in different pods.

There is a very specific scenario with a 30-40% reproduction rate in which the spring app remains connected to the former master node after a failover which leads to being unable to write to readonly replica.

The scenario is as follows:

  1. with the app up and running and connected to the sentinels
  2. we need to restart master redis node (not re-create it) -> to do this, ssh into master node and kill redis process; same result is achieved when kubernetes is OOM killing the pod
  3. sentinels detect the master restart and wait for it to respond
  4. master restarts immediately, but not in master mode
  5. the app reconnects to the node that just restarted
  6. sentinels trigger a failover since the master is not responding anymore
  7. another node becomes master
  8. in 30-40% of the cases the app does not switch the new redis master and cannot write anymore to the node it is connected, because it is connected to a replica

The implementation of the spring app is standard and it connects to a k8s service pointing to the sentinel nodes.

spring:
  redis:
    sentinel:
      nodes: rfr-redisfailover.infrastructure.svc:26379
      master: mymaster

There is a little customization for spring redis that overrides default serializer used by RedisTemplate and replaces it with Jackson and another one that sets up a message listener for some redis keys expiration events, however I don't think this is relevant here.

Logs:

1. sentinel logs

08 Aug 2023 15:02:19.068 * +reboot master mymaster 10.42.0.15 6379
08 Aug 2023 15:02:20.743 * +reboot master mymaster 10.42.0.15 6379
08 Aug 2023 15:02:21.250 * +reboot master mymaster 10.42.0.15 6379
08 Aug 2023 15:02:44.089 # +sdown master mymaster 10.42.0.15 6379
08 Aug 2023 15:02:45.795 # +sdown master mymaster 10.42.0.15 6379
08 Aug 2023 15:02:45.872 # +odown master mymaster 10.42.0.15 6379 #quorum 2/2
08 Aug 2023 15:02:45.872 # +new-epoch 26
08 Aug 2023 15:02:45.872 # +try-failover master mymaster 10.42.0.15 6379
08 Aug 2023 15:02:45.881 # +vote-for-leader 999b68d9cd15e085dbff865fc31dcaf6382874ff 26
08 Aug 2023 15:02:45.888 # +new-epoch 26
08 Aug 2023 15:02:45.888 # +new-epoch 26
08 Aug 2023 15:02:45.895 # +vote-for-leader 999b68d9cd15e085dbff865fc31dcaf6382874ff 26
08 Aug 2023 15:02:45.895 # 9b6940dffbaaaadf8ba30068f854d46974ee1d41 voted for 999b68d9cd15e085dbff865fc31dcaf6382874ff 26
08 Aug 2023 15:02:45.895 # +vote-for-leader 999b68d9cd15e085dbff865fc31dcaf6382874ff 26
08 Aug 2023 15:02:45.895 # a91e3195f1db1d05b3abf7b7f209e3d80ef04ce3 voted for 999b68d9cd15e085dbff865fc31dcaf6382874ff 26
08 Aug 2023 15:02:45.936 # +elected-leader master mymaster 10.42.0.15 6379
08 Aug 2023 15:02:45.936 # +failover-state-select-slave master mymaster 10.42.0.15 6379
08 Aug 2023 15:02:45.992 # +selected-slave slave 10.42.0.16:6379 10.42.0.16 6379 @ mymaster 10.42.0.15 6379
08 Aug 2023 15:02:45.992 * +failover-state-send-slaveof-noone slave 10.42.0.16:6379 10.42.0.16 6379 @ mymaster 10.42.0.15 6379
08 Aug 2023 15:02:46.050 * +failover-state-wait-promotion slave 10.42.0.16:6379 10.42.0.16 6379 @ mymaster 10.42.0.15 6379
08 Aug 2023 15:02:46.165 # +odown master mymaster 10.42.0.15 6379 #quorum 2/2
08 Aug 2023 15:02:46.165 # Next failover delay: I will not start a failover before Tue Aug  8 15:03:06 2023
08 Aug 2023 15:02:46.265 # +sdown master mymaster 10.42.0.15 6379
08 Aug 2023 15:02:46.320 # +odown master mymaster 10.42.0.15 6379 #quorum 3/2
08 Aug 2023 15:02:46.320 # Next failover delay: I will not start a failover before Tue Aug  8 15:03:06 2023
08 Aug 2023 15:02:46.599 # +promoted-slave slave 10.42.0.16:6379 10.42.0.16 6379 @ mymaster 10.42.0.15 6379
08 Aug 2023 15:02:46.600 # +failover-state-reconf-slaves master mymaster 10.42.0.15 6379
08 Aug 2023 15:02:46.643 * +slave-reconf-sent slave 10.42.0.13:6379 10.42.0.13 6379 @ mymaster 10.42.0.15 6379
08 Aug 2023 15:02:46.643 # +config-update-from sentinel 999b68d9cd15e085dbff865fc31dcaf6382874ff 10.42.0.253 26379 @ mymaster 10.42.0.15 6379
08 Aug 2023 15:02:46.643 # +switch-master mymaster 10.42.0.15 6379 10.42.0.16 6379
08 Aug 2023 15:02:46.644 # +config-update-from sentinel 999b68d9cd15e085dbff865fc31dcaf6382874ff 10.42.0.253 26379 @ mymaster 10.42.0.15 6379
08 Aug 2023 15:02:46.644 # +switch-master mymaster 10.42.0.15 6379 10.42.0.16 6379
08 Aug 2023 15:02:46.644 * +slave slave 10.42.0.13:6379 10.42.0.13 6379 @ mymaster 10.42.0.16 6379
08 Aug 2023 15:02:46.644 * +slave slave 10.42.0.11:6379 10.42.0.11 6379 @ mymaster 10.42.0.16 6379
08 Aug 2023 15:02:46.644 * +slave slave 10.42.0.13:6379 10.42.0.13 6379 @ mymaster 10.42.0.16 6379
08 Aug 2023 15:02:46.644 * +slave slave 10.42.0.15:6379 10.42.0.15 6379 @ mymaster 10.42.0.16 6379
08 Aug 2023 15:02:46.644 * +slave slave 10.42.0.15:6379 10.42.0.15 6379 @ mymaster 10.42.0.16 6379
08 Aug 2023 15:02:46.968 # -odown master mymaster 10.42.0.15 6379
08 Aug 2023 15:02:47.593 * +slave-reconf-inprog slave 10.42.0.13:6379 10.42.0.13 6379 @ mymaster 10.42.0.15 6379
08 Aug 2023 15:02:47.593 * +slave-reconf-done slave 10.42.0.13:6379 10.42.0.13 6379 @ mymaster 10.42.0.15 6379
08 Aug 2023 15:02:47.648 # +failover-end master mymaster 10.42.0.15 6379
08 Aug 2023 15:02:47.648 # +switch-master mymaster 10.42.0.15 6379 10.42.0.16 6379
08 Aug 2023 15:02:47.648 * +slave slave 10.42.0.13:6379 10.42.0.13 6379 @ mymaster 10.42.0.16 6379
08 Aug 2023 15:02:47.648 * +slave slave 10.42.0.15:6379 10.42.0.15 6379 @ mymaster 10.42.0.16 6379
08 Aug 2023 15:02:50.570 # +reset-master master mymaster 10.42.0.16 6379
08 Aug 2023 15:02:50.655 # +set master mymaster 10.42.0.16 6379 down-after-milliseconds 5000
08 Aug 2023 15:02:50.664 # +set master mymaster 10.42.0.16 6379 failover-timeout 10000
08 Aug 2023 15:02:50.671 # +set master mymaster 10.42.0.16 6379 down-after-milliseconds 5000
08 Aug 2023 15:02:50.677 # +set master mymaster 10.42.0.16 6379 failover-timeout 10000
08 Aug 2023 15:02:50.684 # +set master mymaster 10.42.0.16 6379 down-after-milliseconds 5000
08 Aug 2023 15:02:50.691 # +set master mymaster 10.42.0.16 6379 failover-timeout 10000
08 Aug 2023 15:02:50.774 * +sentinel sentinel 999b68d9cd15e085dbff865fc31dcaf6382874ff 10.42.0.253 26379 @ mymaster 10.42.0.16 6379
08 Aug 2023 15:02:51.329 * +sentinel sentinel 9b6940dffbaaaadf8ba30068f854d46974ee1d41 10.42.0.254 26379 @ mymaster 10.42.0.16 6379
08 Aug 2023 15:02:51.359 * +slave slave 10.42.0.13:6379 10.42.0.13 6379 @ mymaster 10.42.0.16 6379
08 Aug 2023 15:02:51.367 * +slave slave 10.42.0.15:6379 10.42.0.15 6379 @ mymaster 10.42.0.16 6379

2. spring app logs

2023-08-08 15:02:14.659  INFO 1 --- [xecutorLoop-1-5] i.l.core.protocol.ConnectionWatchdog     : Reconnecting, last destination was 10.42.0.15/10.42.0.15:6379
2023-08-08 15:02:14.659  INFO 1 --- [xecutorLoop-1-4] i.l.core.protocol.ConnectionWatchdog     : Reconnecting, last destination was 10.42.0.15/10.42.0.15:6379
2023-08-08 15:02:14.670  WARN 1 --- [llEventLoop-4-1] i.l.core.protocol.ConnectionWatchdog     : Cannot reconnect to [10.42.0.15/<unresolved>:6379]: finishConnect(..) failed: Connection refused: /10.42.0.15:6379
2023-08-08 15:02:14.671  WARN 1 --- [llEventLoop-4-2] i.l.core.protocol.ConnectionWatchdog     : Cannot reconnect to [10.42.0.15/<unresolved>:6379]: finishConnect(..) failed: Connection refused: /10.42.0.15:6379
2023-08-08 15:02:17.463  INFO 1 --- [llEventLoop-4-6] i.l.core.protocol.ReconnectionHandler    : Reconnected to 10.42.0.15/<unresolved>:6379
2023-08-08 15:02:17.463  INFO 1 --- [llEventLoop-4-5] i.l.core.protocol.ReconnectionHandler    : Reconnected to 10.42.0.15/<unresolved>:6379
2023-08-08 15:03:47.031  WARN 1 --- [container-0-C-1] r.e.r.t.device.DeviceEventHandler   : failed to save batch: [] with cause: Error in execution; nested exception is io.lettuce.core.RedisReadOnlyException: READONLY You can't write against a read only replica.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions