Pod Networking Issues #136
Replies: 5 comments 6 replies
-
Hey @a-voitov-mitgo , we will try Cilium with the mentioned configs and try to see if and where the problem is. Meanwhile, do you have any nodes stuck in recycling? We can take a look immediately. |
Beta Was this translation helpful? Give feedback.
-
@a-voitov-mitgo It seems there was some problem recycling the nodes. While we get to the root of it, we have solved the problem of the nodes stuck for you. We're also looking in to the Cilium networking problem, will get back to you ASAP. |
Beta Was this translation helpful? Give feedback.
-
@a-voitov-mitgo The ability to preserve client IP in Gen1 loadbalancers should be available in coming release, see #109 |
Beta Was this translation helpful? Give feedback.
-
@a-voitov-mitgo I tried to replicate this by installing cilium with the config you shared (thanks again for that). |
Beta Was this translation helpful? Give feedback.
-
@sahil-lakhwani I realized that with this setting, Cilium is supposed to replace kube-proxy. But in my case, it seems Cilium is conflicting with kube-proxy. That’s why (supposedly) I sometimes encounter a bug where the network disappears — because Cilium starts handling the traffic. Now I tried disabling kube-proxy. But in that case, ClusterIP routing doesn't work. In the Cilium documentation, in the section "Kubernetes Without kube-proxy: Troubleshooting" I found that there could be issues with BPF cgroup program attachment. To check this, you need to run a couple of commands, and in my case, the output doesn’t match what it should be. I don’t yet know why — maybe it’s a containerd configuration issue, or something else.
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi everyone!
First of all, we want to say that we really enjoy using Rackspace Spot — the pricing is great and the support team has been wonderful. Below is a description of our current setup and a few issues we’re running into. We’d really appreciate any advice or shared experience!
🚀 Our Architecture
🔄 Custom NAT via Cilium + IPSec
egressGateway
so we can route outbound traffic through dedicated OnDemand nodes viaCiliumEgressGatewayPolicy
.hostFirewall
feature should help, in our case it conflicted with IPSec — so this remains unresolved.🌐 Preserving Client IP for Ingress
hostPort
mode on OnDemand nodes.🔍 Problems We’re Facing
🚫 Intermittent Network Loss in Pods
Symptom:
Occasionally, newly created pods lose all in-cluster networking — DNS fails and they can’t reach other cluster services. But the internet works fine, but you need to replace the DNS address in resolve.conf with an external one.
The node itself and other pods on that node work normally. The problem only occurs in new pods on this node.
Workaround:
Recycling the node (cordon → drain → recycle) usually fixes the issue, but the root cause remains unclear.
It might be related to our Cilium configuration, but we also can't rule out a problem on the Rackspace side.
A regular reboot of the node doesn't help, and other nodes in the cluster continue to work just fine.
🌀 Nodes Stuck in “Recycling”
Recycling
state and remain there for days, with no further actions possible.🤔 Our Questions
Recycling
?Thank you so much for your time and help!
We really appreciate this community and Rackspace’s ongoing support 💙
Beta Was this translation helpful? Give feedback.
All reactions