How to detect and react to TCP only failures #264

germanoeich · 2022-07-16T23:58:49Z

Hello! We use memberlist in a distributed application within our stack. Our use is pretty simple, we just need it to maintain a list of members and pass a port number as metadata between nodes. Memberlist configuration is essentially the default LAN config and we are currently running memberlist v0.3.1

This workload is meshed using linkerd and hosted in a kubernetes cluster. We misconfigured linkerd limits and that caused one of the pods to lose network (The linkerd container was constantly being terminated). This should have caused memberlist to acknowledge the node as dead, but that's not what we saw. The node was still part of the cluster and the only indication that something was wrong in memberlist logs was a push/pull failure to the problematic pod.

The incident started at roughly 17:55 UTC. Here are the logs for the problematic pod (pod name is nirn-proxy-8dfbb9b5c-tjc2w):

2022/07/16 17:52:33 [DEBUG] memberlist: Stream connection from=10.42.13.77:51294
2022/07/16 17:52:56 [DEBUG] memberlist: Initiating push/pull sync with: nirn-proxy-8dfbb9b5c-2kvsg 10.42.10.20:7946
2022/07/16 17:53:05 [DEBUG] memberlist: Stream connection from=10.42.13.77:59110
2022/07/16 17:53:26 [DEBUG] memberlist: Initiating push/pull sync with: nirn-proxy-8dfbb9b5c-rq6kz 10.42.11.131:7946
2022/07/16 17:53:35 [DEBUG] memberlist: Stream connection from=10.42.13.77:38754
2022/07/16 17:53:56 [DEBUG] memberlist: Initiating push/pull sync with: nirn-proxy-8dfbb9b5c-rq6kz 10.42.11.131:7946
2022/07/16 17:54:13 [DEBUG] memberlist: Stream connection from=10.42.13.77:48832
2022/07/16 17:54:26 [DEBUG] memberlist: Initiating push/pull sync with: nirn-proxy-8dfbb9b5c-44sgj 10.42.7.70:7946
2022/07/16 17:54:38 [DEBUG] memberlist: Stream connection from=10.42.13.77:55688
2022/07/16 17:54:56 [DEBUG] memberlist: Initiating push/pull sync with: nirn-proxy-8dfbb9b5c-r94kb 10.42.14.78:7946
2022/07/16 17:55:05 [DEBUG] memberlist: Stream connection from=10.42.13.77:34404
2022/07/16 17:55:15 [DEBUG] memberlist: Stream connection from=10.42.13.77:37694
2022/07/16 17:55:16 [DEBUG] memberlist: Stream connection from=10.42.13.77:38208
2022/07/16 17:55:26 [DEBUG] memberlist: Initiating push/pull sync with: nirn-proxy-8dfbb9b5c-74zh9 10.42.8.134:7946
2022/07/16 17:55:56 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-rt4n6 failed: dial tcp 10.42.5.154:7946: connect: connection refused
2022/07/16 17:56:05 [DEBUG] memberlist: Stream connection from=10.42.13.77:56180
2022/07/16 17:56:26 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-r94kb failed: dial tcp 10.42.14.78:7946: connect: connection refused
2022/07/16 17:56:56 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-74zh9 failed: dial tcp 10.42.8.134:7946: connect: connection refused
2022/07/16 17:57:26 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-74zh9 failed: dial tcp 10.42.8.134:7946: connect: connection refused
2022/07/16 17:57:56 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-2kvsg failed: dial tcp 10.42.10.20:7946: connect: connection refused
2022/07/16 17:58:26 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-44sgj failed: dial tcp 10.42.7.70:7946: connect: connection refused
2022/07/16 17:58:56 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-r94kb failed: dial tcp 10.42.14.78:7946: connect: connection refused
2022/07/16 17:59:26 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-44sgj failed: dial tcp 10.42.7.70:7946: connect: connection refused
2022/07/16 17:59:56 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-rq6kz failed: dial tcp 10.42.11.131:7946: connect: connection refused
2022/07/16 18:00:26 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-74zh9 failed: dial tcp 10.42.8.134:7946: connect: connection refused
2022/07/16 18:00:56 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-r94kb failed: dial tcp 10.42.14.78:7946: connect: connection refused
2022/07/16 18:01:26 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-rt4n6 failed: dial tcp 10.42.5.154:7946: connect: connection refused
2022/07/16 18:01:56 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-rq6kz failed: dial tcp 10.42.11.131:7946: connect: connection refused
2022/07/16 18:02:06 [DEBUG] memberlist: Stream connection from=10.42.13.77:49878
2022/07/16 18:02:26 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-74zh9 failed: dial tcp 10.42.8.134:7946: connect: connection refused
2022/07/16 18:02:56 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-44sgj failed: dial tcp 10.42.7.70:7946: connect: connection refused
2022/07/16 18:03:26 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-74zh9 failed: dial tcp 10.42.8.134:7946: connect: connection refused
2022/07/16 18:03:56 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-rq6kz failed: dial tcp 10.42.11.131:7946: connect: connection refused
2022/07/16 18:04:26 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-rq6kz failed: dial tcp 10.42.11.131:7946: connect: connection refused
2022/07/16 18:04:56 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-rq6kz failed: dial tcp 10.42.11.131:7946: connect: connection refused
2022/07/16 18:05:26 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-czpgp failed: dial tcp 10.42.11.129:7946: connect: connection refused
2022/07/16 18:05:56 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-r94kb failed: dial tcp 10.42.14.78:7946: connect: connection refused
2022/07/16 18:06:26 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-czpgp failed: dial tcp 10.42.11.129:7946: connect: connection refused

There are logs after this, but it's all the same: push/pull failures to all pods in the cluster (Which makes sense, since this pod didn't have network)

Here is the view from all other nodes in the cluster (Aggregated logs from all the other pods, due to verbosity I only included the errors the cluster saw):

[nirn-proxy-8dfbb9b5c-rt4n6]: 2022/07/16 17:57:08 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-tjc2w failed: EOF
[nirn-proxy-8dfbb9b5c-2kvsg]: 2022/07/16 17:59:45 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-tjc2w failed: EOF
[nirn-proxy-8dfbb9b5c-czpgp]: 2022/07/16 17:59:47 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-tjc2w failed: EOF
[nirn-proxy-8dfbb9b5c-74zh9]: 2022/07/16 18:00:05 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-tjc2w failed: EOF
[nirn-proxy-8dfbb9b5c-rq6kz]: 2022/07/16 18:00:06 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-tjc2w failed: EOF
[nirn-proxy-8dfbb9b5c-czpgp]: 2022/07/16 18:00:17 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-tjc2w failed: EOF
[nirn-proxy-8dfbb9b5c-44sgj]: 2022/07/16 18:00:33 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-tjc2w failed: EOF
[nirn-proxy-8dfbb9b5c-r94kb]: 2022/07/16 18:00:44 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-tjc2w failed: EOF
[nirn-proxy-8dfbb9b5c-rq6kz]: 2022/07/16 18:01:06 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-tjc2w failed: EOF
[nirn-proxy-8dfbb9b5c-czpgp]: 2022/07/16 18:01:47 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-tjc2w failed: EOF
[nirn-proxy-8dfbb9b5c-rq6kz]: 2022/07/16 18:02:07 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-tjc2w failed: EOF
[nirn-proxy-8dfbb9b5c-2kvsg]: 2022/07/16 18:02:16 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-tjc2w failed: EOF
[nirn-proxy-8dfbb9b5c-czpgp]: 2022/07/16 18:02:17 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-tjc2w failed: EOF
[nirn-proxy-8dfbb9b5c-44sgj]: 2022/07/16 18:05:05 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-tjc2w failed: EOF
[nirn-proxy-8dfbb9b5c-czpgp]: 2022/07/16 18:05:47 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-tjc2w failed: EOF

What I expected to happen is a split, but what we saw was that the problematic pod was still part of the cluster and requests were still being routed to it, causing increased error rates in the cluster. This incident lasted some 30 minutes and throught it there was no indication of the pod being declared dead by the cluster.

From my understanding of memberlist and linkerd, it seems that the UDP layer is not affected by the mesh, so essentially what happened was that push/pull was failing due to it using TCP but the health checks (terminology might be wrong here, sorry) that run over UDP were successful. Is my understanding correct? Is there any way to detect this condition and preemptively remove the node from the cluster? I'm open to suggestions / advice on how to handle this situation.

The memberlist implementation can be found here

The text was updated successfully, but these errors were encountered:

gouhongshen mentioned this issue Jul 7, 2023

[Bug]: TestHAKeeperCanBootstrapAndRepairShards failed matrixorigin/matrixone#8438

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to detect and react to TCP only failures #264

How to detect and react to TCP only failures #264

germanoeich commented Jul 16, 2022

How to detect and react to TCP only failures #264

How to detect and react to TCP only failures #264

Comments

germanoeich commented Jul 16, 2022