While MetalLB has long been the standard and many CNIs now supports BGP advertisement, issues still remain:
- MetalLB L2:
- Does not offer any loadbalancing between service replicas and throughput is limited to a single node
- Slow failover
- BGP solutions including MetalLB, Calico, Cilium and kube-rouer have other limitations:
- Forward all non-peer traffic through a default gateway. This limits your bandwith to the cluster and adds an extra hop
- Can suffer from assymetric routing issues on LANs and generally requires disabling ICMP redirects
- Requires a BGP capable router at all times which can limit flexibility
- Nodes generally get a static subnet and BGP does close to nothing, neither Cilium nor Flannel actually use it to "distribute" routes between nodes since the routes are readily available from the APIServer.
Furthemore other load-balancing solutions tend to be much heavier - requiring daemonsets that tend to use between 15-100m CPU and between 35-150Mi of RAM in my tests. This amounts to undue energy usage and less room for your actual applications. flannel
is particularly suited, since when in host-gw
mode performs native routing similar to the other CNIs with no VXLAN penalties while using only 1m/10Mi per node.
Lastly all other solutions rely on CRDs which make boostraping a cluster that much more difficult.
At startup minilb
looks up all routes to nodes and prints them out for you so you can set on default gateways
or even directly on devices. The manual step is similar to how you would add each node as a BGP peer, but instead you just add the static route to the node. The podCIDRss are normally assigned by kube-controller-manager and are static once the node is provisioned.
On startup minilb
prints:
Add the following routes to your default gateway (router):
ip route add 10.244.0.0/24 via 192.168.1.30
ip route add 10.244.1.0/24 via 192.168.1.31
ip route add 10.244.2.0/24 via 192.168.1.32
Example queries that minilb
handles:
2024/05/11 13:11:06 DNS server started on :53
;; opcode: QUERY, status: NOERROR, id: 10290
;; flags: qr aa rd; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;mosquitto.automation.minilb. IN A
;; ANSWER SECTION:
mosquitto.automation.minilb. 5 IN A 10.244.19.168
mosquitto.automation.minilb. 5 IN A 10.244.1.103
The idea is that the router has static routes for the podCIDRs for each node (based on the node spec), and we run a resolver which resolves the service "hostname" to pod IPs. One of the benefits is that you can advertise the static routes over DHCP to remove the hop through the router for traffic local to the LAN. This also means you don't need BGP and can use any router that supports static routes. To make ingresses work, the controller sets the status.loadBalancer.Hostname
of each service to the hostname that resolves to the pods, that way external-dns
and k8s-gateway
will CNAME your defined Ingress hosts
to the associated .minilb
record.
minilb
updates the external IPs of LoadBalancer services to the configured domain:
$ k get svc -n haproxy internal-kubernetes-ingress
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
internal-kubernetes-ingress LoadBalancer 10.110.115.188 internal-kubernetes-ingress.haproxy.minilb 80:...
They resolve directly to the pod which your network knows how to route:
$ nslookup internal-kubernetes-ingress.haproxy.minilb
Server: 192.168.1.1
Address: 192.168.1.1#53
Name: internal-kubernetes-ingress.haproxy.minilb
Address: 10.244.19.176
Name: internal-kubernetes-ingress.haproxy.minilb
Address: 10.244.1.104
When k8s-gateway
or external-dns
are present, they will CNAME any ingress hosts to our minilb service hostname.
$ k get ingress paperless-ngx
NAME CLASS HOSTS ADDRESS PORTS AGE
paperless-ngx haproxy-internal paperless.sko.ai internal-kubernetes-ingress.haproxy.minilb 80 22d
$ curl -I https://paperless.sko.ai:8443
HTTP/2 302
content-length: 0
location: https://gate.sko.ai/?rd=https://paperless.sko.ai:8443/
cache-control: no-cache
minilb
expects your default gateway to have static routes for the nodes podCIDRs
. In order to help set that up it prints podCIDRs assigned by kube-controller-manager on startup. Typically this is achieved by running kube-controller-manager
with the --allocate-node-cidrs
flag.
Both flanneld
and kube-router
should require no additional configuration as they use podCIDRs
by default.
For Cilium the Kubernetes Host Scope IPAM should be used. The default is Cluster Scope.
Calico does not use the CIDR's assigned by kube-controller-manager
but instead assigns blocks of /28 dynamically. This makes it unsuitable for use with minilb
.
Reference the example HA deployment deployment leveraging bjw-s' app-template. You must run only a single replica with -controller=true
but can otherwise run as many replicas as you like. Your network should then be configured to use minilb as a resolver for the .minilb
(or any other chosen) domain. The suggested way to do this is to expose minilb
itself as a NodePort
, after which after can use type=LoadBalancer
for everything else.
By far the biggest limitations is that because we completely bypass the service ip and kube-proxy
, the service port
to targetPort
mapping is bypassed. This means that you need to have the containers listening to the same ports you want to access them by. Traditionally this was a problem for ports less than 1024
which required root, but this is now easily achieved directly since 1.22:
apiVersion: v1
kind: Pod
metadata:
name: sysctl-example
spec:
securityContext:
sysctls:
- name: net.ipv4.ip_unprivileged_port_start
value: "80"
There are a few other things which you should consider:
- Users needs to respect the short TTLs of the
minilb
response - Some apps do DNS lookups only once and cache the results indefinitely.
No, it's still very new and experimental, but you may use it for small setups such as in your homelab.