This document provides steps to troubleshoot pod outbound connectivity issue when kube-egress-gateway is in use. Troubleshooting contains two steps:
- Validate static egress gateway is successfully provisioned and egress traffic has right source IP.
- Validate pod provisioning and pod-gateway connectivity.
The first step is to check whether kube-egress-gateway controller has successfully provisioned Azure resources for your egress gateway. The controller reports status in StaticGatewayConfiguration
CR object. Run kubectl get staticgatewayconfigurations -n <your namespace> <your sgw name> -o yaml
to check:
apiVersion: egressgateway.kubernetes.azure.com/v1alpha1
kind: StaticGatewayConfiguration
metadata:
...
spec:
...
status:
gatewayServerProfile:
PrivateKeySecretRef:
apiVersion: v1
kind: Secret
name: <your sgw name>
PublicKey: ***
Ip: 10.243.0.6 # ilb private IP in your vnet
Port: 6000
egressIpPrefix: 1.2.3.4/31 # egress public IP prefix
The controller creates a secret storing the gateway side wireguard private key with the same namespace and name as your StaticGatewayConfiguration
. This information is displayed in .status.gatewayServerProfile.PrivateKeySecretRef
field. PublicKey
is base64 encoded wireguard public key used by the gateway. Ip
is the gateway ILB frontend IP. This IP comes from the subnet provided in Azure cloud config. Port
is LoadBalancing rule frontend and backend port. All StaticGatewayConfiguration
s deployed to the same gateway VMSS share the same ILB frontend and backend but have separate LoadBalancing rules with different ports. And most importantly, egressIpPrefix
is the egress source IPNet of the pods using this gateway. If you see any of these not showing in status, you can describe the CR objects and see if there are error events:
$ kubectl describe staticcgatewayconfiguration -n <your namespace> <your sgw name>
Furthermore, you can check kube-egress-gateway-controller-manager
log and see if there's any error:
$ kubectl logs -f -n kube-egress-gateway-system kube-egress-gateway-controller-manager-**********-*****
Gateway DaemonSet controller manages another CR: GatewayStatus
to record configurations on each node. This is for purely debugging purpose. Run kubectl get gatewaystatus -A
to show existing GatewayStatus
resources in the cluster:
+ kubectl get gatewaystatus -A
NAMESPACE NAME AGE
kube-egress-gateway-system <gateway nodes 1 name> 107s
kube-egress-gateway-system <gateway nodes 2 name> 106s
You should see one GatewayStatus
object for each gateway node, with the same name as the node. If you don't see any object created, that means gateway daemon is still working or has encountered some error.
Look into a specific GatewayStatus
object:
apiVersion: egressgateway.kubernetes.azure.com/v1alpha1
kind: GatewayStatus
metadata:
namespace: ...
name: ...
ownerReferences: <node object>
spec:
readyGatewayNamespaces:
- interfaceName: wg-6000
staticGatewayConfiguration: <sgw namespace>/<sgw name>
- ...
Check .spec.readyGatewayNamespaces
list and see if your StaticGatewayConfiguration
is included. If included, it means the network namespace corresponding to the gateway is successfully provisioned. Otherwise, you can check kube-egress-gateway-daemon-manager
log on the same node:
$ kubectl logs -f -n kube-egress-gateway-system kube-egress-gateway-daemon-manager-*****
After checking the CR objects, you can login to the gateway node and check network settings directly:
- Check network namespace named as
ns-static-egress-gateway
:$ ip netns ns-static-egress-gateway (id: 0)
- Check network interfaces, routes, iptables rules within the network namespace:
The network namespace has onelo
interface and onehost0
interface to communicate with host network namespace.
There is onewg-*
interface for eachStaticGatewayConfiguration
. The number after thewg-
prefix corresponds to thestatus.gatewayServerProfile.port
of the CR object.For routes, you should see the IP of your pod using the gateway is added in the routing table.$ ip netns exec ns-static-egress-gateway ip addr 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 4: wg-6000: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420 qdisc noqueue state UNKNOWN group default qlen 1000 # expect to see this wg-* interface for your StaticGatewayConfiguration. link/none inet6 fe80::1/64 scope link valid_lft forever preferred_lft forever 5: host0@if6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether 7e:22:0f:8b:4c:da brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.243.0.7/32 scope global host0 valid_lft forever preferred_lft forever
For iptables-rules, there are several rules added to masquerade packets. The target IP is the private IP of the$ ip netns exec ns-static-egress-gateway ip route default via 10.243.0.5 dev host0 10.243.0.5 dev host0 scope link 10.244.0.14 dev wg-6000 scope link # expect to see pod's IP routed via the wg-* interface.
StaticGatewayConfiguration
private IP.
The rule names should have suffix of thestatus.gatewayServerProfile.port
, same as the wireguard interface.$ ip netns exec ns-static-egress-gateway iptables-save # traffic is masqueraded *nat :PREROUTING ACCEPT [0:0] :INPUT ACCEPT [0:0] :OUTPUT ACCEPT [0:0] :POSTROUTING ACCEPT [0:0] :EGRESS-GATEWAY-MARK-6000 - [0:0] :EGRESS-GATEWAY-SNAT-6000 - [0:0] -A PREROUTING -m comment --comment "kube-egress-gateway mark packets from gateway link wg-6000" -j EGRESS-GATEWAY-MARK-6000 -A POSTROUTING -m comment --comment "kube-egress-gateway sNAT packets from gateway link wg-6000" -j EGRESS-GATEWAY-SNAT-6000 -A EGRESS-GATEWAY-MARK-6000 -i wg-6000 -j CONNMARK --set-xmark 0x1770/0xffffffff -A EGRESS-GATEWAY-SNAT-6000 -o host0 -m connmark --mark 0x1770 -j SNAT --to-source 10.243.0.7 COMMIT
- Check wireguard setup, public key and listening port should match SGW
.status.gatewayServerProfile.PublicKey
and.status.gatewayServerProfile.Port
respectively:Note: if$ ip netns exec ns-static-egress-gateway wg interface: wg-6000 public key: ****** private key: (hidden) listening port: 6000 peer: ***** # peer public key endpoint: 10.243.4.4:35678 allowed ips: 10.244.0.14/32 latest handshake: 10 minutes, 38 seconds ago transfer: 11.43 KiB received, 11.57 KiB sent
wg
command does not exist, runapt install wireguard-tools
to install
Firstly, check whether pod is in "Running" state. If pod gets stuck in "ContainerCreating" state, this means kube-egress-gateway CNI plugin might have some difficulties configuring the pod network namespace. Check kubelet log for any errors. CNI plugin configuration is stored in /etc/cni/net.d/
and the default conflist file name is called 01-egressgateway.conflist
. Check if the file exists and if it's the first file in alphabet order under /etc/cni/net.d/
to make sure it can take effect. You may also check cni manager's log and look for any errors:
$ kubectl logs -f -n kube-egress-gateway-system kube-egress-gateway-cni-manager-*****
For each pod using an egress gateway, kube-egress-gateway CNI plugin creates a corresponding PodEndpoint
CR in the pod's namespace, with the same name as the pod. This CR contains pod side wireguard configurations so that gateway side can add it as a peer. You can view the details about this object by running kubectl get podendpoint -n <pod namespace> <pod name> -o yaml
:
apiVersion: egressgateway.kubernetes.azure.com/v1alpha1
kind: PodEndpoint
metadata:
...
spec:
podIpAddress: XXX.XXX.XXX.XXX/32
podPublicKey: **********
staticGatewayConfiguration: <SGC name>
Pod IPNet (provisioned by the main CNI plugin in the cluster), pod side wireguard public key and the StaticGatewayConfiguration
name are provided. Make sure this object exists. Otherwise, look for CNI plugin error from kubelet log.
You can run crictl to get pod's network namespace and check network and wireguard setup inside pod's network namespace:
$ ip netns exec cni-***** ip addr # cni-**** is pod's netns
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0@if24: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default # provisioned by the main CNI plugin
link/ether b6:ee:c6:23:00:99 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.244.0.20/24 brd 10.244.0.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::b4ee:c6ff:fe23:99/64 scope link
valid_lft forever preferred_lft forever
25: wg0: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420 qdisc noqueue state UNKNOWN group default qlen 1000 # provisioned by kube-egress-gateway CNI plugin
link/none
inet6 fe80::b4ee:c6ff:fe23:99/64 scope link # same as eth0 inet6 ip
valid_lft forever preferred_lft forever
$ ip netns exec cni-***** ip route
default via inet6 fe80::1 dev wg0
10.244.0.1 dev eth0 scope link
$ ip netns exec cni-***** ip rule # make sure response packets from ingress NOT routed to gateway
0: from all lookup local
32765: from all fwmark 0x2222 lookup 8738
32766: from all lookup main
32767: from all lookup default
$ ip netns exec cni-***** iptables-save
*mangle
:PREROUTING ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]
-A PREROUTING -i eth0 -j MARK --set-xmark 0x2222/0xffffffff
-A PREROUTING -j CONNMARK --save-mark --nfmask 0xffffffff --ctmask 0xffffffff
-A OUTPUT -m connmark --mark 0x2222 -j CONNMARK --restore-mark --nfmask 0xffffffff --ctmask 0xffffffff
COMMIT
$ ip netns exec cni-***** wg
interface: wg0
public key: ********** # should be same as the one in PodEndpoint CR
private key: (hidden)
listening port: 56959 # random
peer: ********** # gateway side wireguard public key
endpoint: 10.243.0.6:6000 # <ilb frontend IP>:<LB rule port>
allowed ips: 0.0.0.0/0, ::/0
latest handshake: 1 hour, 1 minute, 8 seconds ago # wireguard connectivity is GOOD!
transfer: 4.37 KiB received, 4.99 KiB sent
In particular, the handshake and transfer statistics from wg
command verifies connectivity from wireguard tunnel.
You can run wg
command on the gateway node to see the peer is added:
$ ip netns exec ns-static-egress-gateway wg
interface: wg0
public key: **********
private key: (hidden)
listening port: 6000
peer: ********** # pod's wireguard public key
endpoint: 10.243.4.4:56959
allowed ips: 10.244.0.20/32 # pod's IP
latest handshake: 9 minutes, 2 seconds ago # wireguard connectivity is GOOD!
transfer: 3.80 KiB received, 4.49 KiB sent
Also, a readyPeer
should be added in GatewayStatus
CR:
apiVersion: egressgateway.kubernetes.azure.com/v1alpha1
kind: GatewayStatus
metadata:
...
spec:
readyGatewayNamespaces:
...
readyPeerConfigurations:
- interfaceName: wg-6000
podEndpoint: <pod namespace>/<pod name>
publicKey: ****** <pod's wireguard public key>
One important step to troubleshoot pod egress connectivity is to make sure traffic can be routed to one of the gateway VMSS instance by gateway ILB. For this, you need to check Azure LoadBalancer health probe status and see if backends are available:
If all above configurations look correct, the last step is to take packet capture. You can run tcpdump to trace the egress packets.
To run tcpdump
inside a network namespace:
$ ip netns exec <network ns name> tcpdump -i <interface> -vvv
- Due to lack of native support for Wireguard on windows, pods in windows nodepools cannot use this feature and gateway nodepool itself is limited to linux also.
- Due to IPv6 secondary IP config limitation , this feature currently is not supported in dual-stack clusters.
- Because we use CNI to setup pods' side network, existing pods must be restarted to use this feature.