Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement ECMP for unsafe_routes #1332

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

dioss-Machiel
Copy link

@dioss-Machiel dioss-Machiel commented Feb 19, 2025

This PR implements ECMP support in Nebula. The implementation uses hash-threshold mapping (like in the Linux kernel) which allows you to define weights per gateway.

This can be used for example to aggregate multiple links and provide redundancy to an external location where nebula cannot be installed.

The change is backwards compatible and should not impact normal operation.
ECMP routes can be defined in the config file (example below) and also via the use_system_route_table feature, which means that on Linux you can use a BGP daemon to add / remove multipath routes.

Example config:

tun:
  unsafe_routes:
	# Backwards compatibility
	- route: 192.168.86.0/24
	  via: 192.168.100.10
	
	# Single gateway, weight optional
	- route: 192.168.87.0/24
	  via: 
		- gateway: 10.0.0.1

	# Multipath, weight optional (defaults to 1), will balance equally
	- route: 192.168.87.0/24
	  via: 
		- gateway: 10.0.0.1
		- gateway: 10.0.0.2

	# Multipath, weight defined, will balance according to weights
	- route: 192.168.87.0/24
	  via: 
		- gateway: 10.0.0.1
		  weight: 10
		  
		- gateway: 10.0.0.2
		  weight: 5

Implementation note: if the gateway where the packet should be routed to is not reachable the first available gateway is chosen, so when one of the gateways is down the traffic is no longer properly being balanced, but there is still connectivity.

Copy link

Thanks for the contribution! Before we can merge this, we need @dioss-Machiel to sign the Salesforce Inc. Contributor License Agreement.

@johnmaguire
Copy link
Collaborator

johnmaguire commented Mar 7, 2025

I think this PR achieves the goals of #353 It doesn't look like this PR supports failover.

@johnmaguire johnmaguire linked an issue Mar 7, 2025 that may be closed by this pull request
@johnmaguire
Copy link
Collaborator

Hi @dioss-Machiel, this looks very interesting. One note:

This can be used for example to aggregate multiple links and provide redundancy to an external location where nebula cannot be installed.

It does look like this PR may offer effective load balancing between gateways, but I'm not sure if it provides redundancy. If a given gateway is down, will this code detect it, and route on the other gateway instead?

@dioss-Machiel
Copy link
Author

dioss-Machiel commented Mar 10, 2025

This implementation does indeed provide redundancy in the following network setup:

image

How this technically works:
For each gateway that serves a route a bucket is calculated, according to the weight.
A packet that needs to be routed is hashed, and the gateway is determined by the bucket it falls in.
If that gateway is not reachable then the entire list of possible gateways for that route is tried until exhausted.
This means when one of the gateways is down the traffic is no longer properly being balanced if you have more than two gateways, but there is still connectivity.

For example: if you have three gateways with weight "1" then each of them handles 1/3 of the traffic.
If a gateway goes down then one will handle 2/3 and another will handle 1/3

If you only have two gateways then each of them handles 1/2 of the traffic, if one goes down the other gateway will handle all the traffic.

This is the link to the relevant code in this PR: (inside.go, line 187)
https://github.com/slackhq/nebula/pull/1332/files#diff-47da64e8ca988394a12810c9ba96e3e170b9be19e0a51049b74ed2a27d4a4addR187

A more advanced implementation could recalculate the buckets when nodes go up or down to keep balancing the traffic on the remaining gateways but that comes with more complexity and maybe a slight penalty to routing speed.

@nbrownus
Copy link
Collaborator

@dioss-Machiel we merged a pretty large change to support ipv6 on the overlay, mind rebasing to clear up the conflicts?

@wadey wadey added this to the v2.0.0 milestone Mar 11, 2025
Still always use the first route found,
this should not change any routing behaviour in nebula.
Prefer first route found, if gatway unavailable then
keep trying untill all options are exhausted.
WIP Multipath is working but routing table updates are still broken
@dioss-Machiel
Copy link
Author

@nbrownus Conflicts have been resolved

@dioss-Machiel
Copy link
Author

I fixed the testify errors

@nbrownus
Copy link
Collaborator

nbrownus commented Mar 13, 2025

Did a quick benchmark, xxh3 has a really nice distribution but it's quite slow compared to some other options.

BenchmarkHashTest/cespare/xxhash/v2-10         	74028885	        15.99 ns/op	       0 B/op	       0 allocs/op
BenchmarkHashTest/maphash-10                   	65325824	        18.25 ns/op	       0 B/op	       0 allocs/op
BenchmarkHashTest/zeebo/xxh3-10                	27690496	        43.31 ns/op	       0 B/op	       0 allocs/op
BenchmarkHashTest/OneOfOne/xxhash-10           	79155019	        15.29 ns/op	       0 B/op	       0 allocs/op

This benchmark includes 2 other xxhash implementations but they have slightly worse distribution while being much faster than xxh3.

I included maphash since its baked into golang, it has inherent randomness though so the result of the balance tests can change between runs.

As a reference point, the firewall takes ~40ns/op on my machine to pass a packet on a group or name match. Worst case performance in the firewall is ~95ns/op to evaluate and fail a full rule evaluation.

If we went with OneOfOne/xxhash or cespare/xxhash/v2 the current tests would fail:

--- FAIL: TestPacketsAreBalancedEqually (0.02s)
    balance_test.go:60: 
        	Error Trace:	routing/balance_test.go:60
        	Error:      	Max difference between 21845 and 21651 allowed is 100, but difference was 194
        	Test:       	TestPacketsAreBalancedEqually
        	Messages:   	Expected 21845 +/- 100, but got 21651
    balance_test.go:61: 
        	Error Trace:	routing/balance_test.go:61
        	Error:      	Max difference between 21845 and 22042 allowed is 100, but difference was -197
        	Test:       	TestPacketsAreBalancedEqually
        	Messages:   	Expected 21845 +/- 100, but got 21651

Which doesn't seem too terrible to me.

We can take this a bit further as well by removing the local and remote addr, cespare/xxhash turns into 7.5ns/op.

If we can drop down to just evaluating the local and remote ports then we can use a much faster hash described here (taken from #768) which has really nice distribution (does not require modifying your tests) and incredible speed, 0.3117 ns/op on my machine.

func hashPacket(p *firewall.Packet) int {
	x := (uint32(p.LocalPort) << 16) | uint32(p.RemotePort)
	x ^= x >> 16
	x *= 0x21f0aaad
	x ^= x >> 15
	x *= 0xd35a2d97
	x ^= x >> 15

	return int(x) & 0x7FFFFFFF
}

overlay/route.go Outdated
}

if gatewayWeight < 1 || gatewayWeight > math.MaxInt32 {
return nil, fmt.Errorf("entry .weight in tun.unsafe_routes[%v].via[%v] is not in range (1-%d) : %v", i+1, ig+1, math.MaxInt32, metric)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metric should be gatewayWeight

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed!, Fixed now

t.l.WithField("route", r).Debug("Ignoring route update, not in our network")
} else {
// p.Hops+1 = weight of the route
gateways = append(gateways, routing.NewGateway(gwAddr, p.Hops+1))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When doing sudo ip route add 3.3.3.4/32 nexthop via 192.168.5.13 weight 5 nexthop via 192.168.5.14 weight 10 I do see p.Weight set correctly on my machine, is there a reason to use r.Hops+1 instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be overlooking something but I don't see a Weight field on NexthopInfo in netlink v1.3.0, there is only a "Weight" in the String() method for NexthopInfo

https://github.com/vishvananda/netlink/blob/6f5713947556a0288c5cb71f036f9e91924ebcaa/route.go#L169

type NexthopInfo struct {
	LinkIndex int
	Hops      int
	Gw        net.IP
	Flags     int
	NewDst    Destination
	Encap     Encap
	Via       Destination
}

func (n *NexthopInfo) String() string {
	elems := []string{}
	elems = append(elems, fmt.Sprintf("Ifindex: %d", n.LinkIndex))
	if n.NewDst != nil {
		elems = append(elems, fmt.Sprintf("NewDst: %s", n.NewDst))
	}
	if n.Encap != nil {
		elems = append(elems, fmt.Sprintf("Encap: %s", n.Encap))
	}
	if n.Via != nil {
		elems = append(elems, fmt.Sprintf("Via: %s", n.Via))
	}
	elems = append(elems, fmt.Sprintf("Weight: %d", n.Hops+1))
	elems = append(elems, fmt.Sprintf("Gw: %s", n.Gw))
	elems = append(elems, fmt.Sprintf("Flags: %s", n.ListFlags()))
	return fmt.Sprintf("{%s}", strings.Join(elems, " "))
}

@@ -589,12 +636,12 @@ func (t *tun) updateRoutes(r netlink.RouteUpdate) {
newTree := t.routeTree.Load().Clone()

if r.Type == unix.RTM_NEWROUTE {
t.l.WithField("destination", r.Dst).WithField("via", r.Gw).Info("Adding route")
newTree.Insert(dst, gwAddr)
t.l.WithField("destination", dst).WithField("via", gateways).Info("Adding route")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to transform gateways to a human readable string here and below.

Currently this is logging:

via="[{{{0 281473913980173} {0xc00014a150}} 5 715827882} {{{0 281473913980174} {0xc00014a150}} 10 2147483647}]"

I haven't audited all log sites for []routing.Gateway yet but it would probably be best to deal with a concrete type Gateways than the slice to avoid future foot guns when trying to log this structure.

Doing the easy thing and casting for logging is a bit easier but the end implementation looks mostly the same

type viaGateways []routing.Gateway
func (g viaGateways) String() string {
	str := ""
	for i, gw := range g {
		str += gw.String()
		if i < len(g)-1 {
			str += ", "
		}
	}
	return str
}

...

    t.l.WithField("destination", dst).WithField("via", viaGateways(gateways)).Info("Adding route")

This gives a slightly better log line:

via="192.168.5.13:5/715827882, 192.168.5.14:10/2147483647"

To which I think we should adjust the Gateway.String() to look something along the lines of

func (g *Gateway) String() string {
	return fmt.Sprintf("%s weight %d", g.addr, g.weight)
}

I would like to avoid changing via from a string to an [] when logging in json to avoid the pitfalls of changing the field type for systems like elasticsearch.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the idea, does this look like a reasonable log line to you, or does it look to much like json?

level=info msg="Adding route" destination=10.152.5.22/32 via="{addr: 10.159.0.1, weight: 1}, {addr: 10.159.0.2, weight: 1}"

@dioss-Machiel
Copy link
Author

I'm not married to the hashing algorithm, I did some quick research and other L4 balancing implementations also only hash by source and destination port. Using the hash-prospector implementation looks good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add unsafe_route metric / hop count
4 participants