You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello! We use memberlist in a distributed application within our stack. Our use is pretty simple, we just need it to maintain a list of members and pass a port number as metadata between nodes. Memberlist configuration is essentially the default LAN config and we are currently running memberlist v0.3.1
This workload is meshed using linkerd and hosted in a kubernetes cluster. We misconfigured linkerd limits and that caused one of the pods to lose network (The linkerd container was constantly being terminated). This should have caused memberlist to acknowledge the node as dead, but that's not what we saw. The node was still part of the cluster and the only indication that something was wrong in memberlist logs was a push/pull failure to the problematic pod.
The incident started at roughly 17:55 UTC. Here are the logs for the problematic pod (pod name is nirn-proxy-8dfbb9b5c-tjc2w):
There are logs after this, but it's all the same: push/pull failures to all pods in the cluster (Which makes sense, since this pod didn't have network)
Here is the view from all other nodes in the cluster (Aggregated logs from all the other pods, due to verbosity I only included the errors the cluster saw):
[nirn-proxy-8dfbb9b5c-rt4n6]: 2022/07/16 17:57:08 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-tjc2w failed: EOF
[nirn-proxy-8dfbb9b5c-2kvsg]: 2022/07/16 17:59:45 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-tjc2w failed: EOF
[nirn-proxy-8dfbb9b5c-czpgp]: 2022/07/16 17:59:47 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-tjc2w failed: EOF
[nirn-proxy-8dfbb9b5c-74zh9]: 2022/07/16 18:00:05 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-tjc2w failed: EOF
[nirn-proxy-8dfbb9b5c-rq6kz]: 2022/07/16 18:00:06 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-tjc2w failed: EOF
[nirn-proxy-8dfbb9b5c-czpgp]: 2022/07/16 18:00:17 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-tjc2w failed: EOF
[nirn-proxy-8dfbb9b5c-44sgj]: 2022/07/16 18:00:33 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-tjc2w failed: EOF
[nirn-proxy-8dfbb9b5c-r94kb]: 2022/07/16 18:00:44 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-tjc2w failed: EOF
[nirn-proxy-8dfbb9b5c-rq6kz]: 2022/07/16 18:01:06 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-tjc2w failed: EOF
[nirn-proxy-8dfbb9b5c-czpgp]: 2022/07/16 18:01:47 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-tjc2w failed: EOF
[nirn-proxy-8dfbb9b5c-rq6kz]: 2022/07/16 18:02:07 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-tjc2w failed: EOF
[nirn-proxy-8dfbb9b5c-2kvsg]: 2022/07/16 18:02:16 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-tjc2w failed: EOF
[nirn-proxy-8dfbb9b5c-czpgp]: 2022/07/16 18:02:17 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-tjc2w failed: EOF
[nirn-proxy-8dfbb9b5c-44sgj]: 2022/07/16 18:05:05 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-tjc2w failed: EOF
[nirn-proxy-8dfbb9b5c-czpgp]: 2022/07/16 18:05:47 [ERR] memberlist: Push/Pull with nirn-proxy-8dfbb9b5c-tjc2w failed: EOF
What I expected to happen is a split, but what we saw was that the problematic pod was still part of the cluster and requests were still being routed to it, causing increased error rates in the cluster. This incident lasted some 30 minutes and throught it there was no indication of the pod being declared dead by the cluster.
From my understanding of memberlist and linkerd, it seems that the UDP layer is not affected by the mesh, so essentially what happened was that push/pull was failing due to it using TCP but the health checks (terminology might be wrong here, sorry) that run over UDP were successful. Is my understanding correct? Is there any way to detect this condition and preemptively remove the node from the cluster? I'm open to suggestions / advice on how to handle this situation.
Hello! We use memberlist in a distributed application within our stack. Our use is pretty simple, we just need it to maintain a list of members and pass a port number as metadata between nodes. Memberlist configuration is essentially the default LAN config and we are currently running memberlist v0.3.1
This workload is meshed using linkerd and hosted in a kubernetes cluster. We misconfigured linkerd limits and that caused one of the pods to lose network (The linkerd container was constantly being terminated). This should have caused memberlist to acknowledge the node as dead, but that's not what we saw. The node was still part of the cluster and the only indication that something was wrong in memberlist logs was a push/pull failure to the problematic pod.
The incident started at roughly 17:55 UTC. Here are the logs for the problematic pod (pod name is nirn-proxy-8dfbb9b5c-tjc2w):
There are logs after this, but it's all the same: push/pull failures to all pods in the cluster (Which makes sense, since this pod didn't have network)
Here is the view from all other nodes in the cluster (Aggregated logs from all the other pods, due to verbosity I only included the errors the cluster saw):
What I expected to happen is a split, but what we saw was that the problematic pod was still part of the cluster and requests were still being routed to it, causing increased error rates in the cluster. This incident lasted some 30 minutes and throught it there was no indication of the pod being declared dead by the cluster.
From my understanding of memberlist and linkerd, it seems that the UDP layer is not affected by the mesh, so essentially what happened was that push/pull was failing due to it using TCP but the health checks (terminology might be wrong here, sorry) that run over UDP were successful. Is my understanding correct? Is there any way to detect this condition and preemptively remove the node from the cluster? I'm open to suggestions / advice on how to handle this situation.
The memberlist implementation can be found here
The text was updated successfully, but these errors were encountered: