Skip to content

Spring Cleaning#648

Open
taspelund wants to merge 32 commits intomainfrom
trey/random-fix-fun-time
Open

Spring Cleaning#648
taspelund wants to merge 32 commits intomainfrom
trey/random-fix-fun-time

Conversation

@taspelund
Copy link
Contributor

@taspelund taspelund commented Feb 27, 2026

Each commit has a meaningful scope and message, and can be read in order when reviewing to help chunk up the PR into digestible chunks. Summary of the changes below.

BGP Protocol Correctness

  • treat-as-withdraw bug fixes — Updates flagged treat-as-withdraw by the parser were never actually withdrawn; fixed along with enforce-first-AS scoping (eBGP only) and FSM-layer error handling
  • AFI/SAFI negotiation — Send NOTIFICATION when no common address families exist; legacy peers without MP-BGP are treated as IPv4 Unicast only
  • Validate OPEN in Established — Handle OPEN messages received while in Established state (deterministic collision resolution)
  • Disallow wildcard match arms in FSM — Clippy lints ensure exhaustive matching so future events can't be silently dropped
  • Skip RIB walks for non-negotiated AFIs — Avoid locking/walking RIBs for address families the peer doesn't support
  • BGP Bestpath is updated to better align with RFC 4271 (BGP bestpath algorithm should more closely follow RFC 4271 #527): adds comparison of origin path attribute and internal vs external (ibgp vs ebgp), adds checks after each step to see if there is only a single path left (grab single bestpath early if present), adds an explicit multipath check before progressing into new tie-breakers (router-id + peer-ip)
  • Stop unconditionally filtering locally originated prefixes from received advertisements/withdrawals (bgp: consider removing unconditional rx filtering of originated routes #583)

QoS & Socket Management

  • DSCP / Traffic Class support — Configurable DSCP marking for BGP packets (fixes BGP should set DSCP bits #565), with graceful in-flight socket updates (no session reset)
  • Fix IPv6 TTL/hop-limit sockopts — Use IPV6_UNICAST_HOPS (not IP_TTL) and IPV6_MINHOPCOUNT (not IPV6_HOPLIMIT) for IPv6 peers (fixes TTL settings are broken for IPv6 peers #647)
  • Default TTL of 255 — Explicit BGP_DEFAULT_TTL = 255 instead of relying on kernel defaults
  • TCP_NODELAY — Set on all BGP TCP sessions

Observability

  • Per-AFI prefix counters, last reset reason/timestamp, last NOTIFICATION tracking
  • Message history split — Major vs all buffers (keepalives stay in "all" only)
  • More FSM events classified as major

RIB & Static Routing

Graceful Config Changes

Other

Issues Fixed

#527, #528, #565, #567, #581, #583, #614, #620, #647

Three tests failed under cargo test (thread-per-test) but passed
under cargo nextest (process-per-test) due to shared global state
in the channel-based test network simulation:

1. unnumbered_peering_helper() leaked dispatcher threads by not
   returning dispatcher handles to callers. Leaked threads would
   periodically re-bind to addresses in the global NET endpoint
   map, stealing entries from concurrently running tests.

2. The simulated Network had no unbind mechanism, so endpoint
   entries persisted after a Listener was dropped, leaving stale
   state for subsequent tests.

3. Several tests used hardcoded scope_ids (2, 3) for link-local
   addresses, causing collisions in the shared NET endpoint map
   when tests ran as concurrent threads in the same process.

Fix by:
- Adding Network::unbind() and Listener::Drop to clean up NET
  endpoint entries when a listener goes out of scope
- Returning dispatchers from unnumbered_peering_helper() so all
  callers can shut them down during test cleanup
- Switching hardcoded scope_ids to next_scope_id() for unique
  per-test addresses, matching the newer test infrastructure
This adds logic to send_update() to be able to split a large amount of
prefixes that would cause the Update to exceed the max BGP packet size.
Now, if we have more originated prefixes than can fit into a single
Update, we will split the prefixes into chunks that fit the remaining
available space (max size - header - fixed length fields - attributes).

This also updates the basic_update* tests to exchange and validate 4k
routes instead of just 1. This not only exercises the Update chunking,
but also recreates the conditions where #634 was observed.

Fixes: #614
Allow static routes to use cross-address-family next-hops
(e.g. IPv4 prefix via IPv6 next-hop and vice versa).

Introduce API version 7 (EXTENDED_NH_STATIC) with StaticRoute4 and
StaticRoute6 types that accept IpAddr for the nexthop field
instead of the AF-specific Ipv4Addr/Ipv6Addr. The previous
types are retired as V1 variants with version-gated endpoints.

The internal layers (StaticRouteKey, Path, RIB) already handle
cross-AF next-hops; this closes the gap at the API boundary.

Closes: #581
When stdout is piped into a program that closes early (e.g.
`mgadm bgp status neighbors 47 | head`), Rust's println!
macro panics on the resulting EPIPE, causing a SIGABRT.

Add println_nopipe!/print_nopipe! macros to mg-common that
silently exit on BrokenPipe instead of panicking. Use them
throughout mgadm's non-test print calls. Also catch EPIPE
errors propagated via ? from TabWriter writeln/flush paths
at both mgadm and ddmadm main() boundaries.

Fixes #567
The Ord impl we have for Path was only checking nexthop_interface and
vlan_id for an identity check which returns Ordering::Equal if bgp is
None and the (nexthop, nexthop_interface, vlan_id) are the same.
However, when all these fields are not the same that impl moved onto a
tie-breaker which compared the nexthop before moving onto the bgp cmp,
without ever doing a cmp for nexthop_interface or vlan_id. This meant
that any two Paths with the same nexthop but different nexthop_interface
and/or vlan_id would be treated as identical paths.

BTreeSet relies on this Ord impl to determine whether an insertion
should replace or add an element. The above issue in the Ord impl would
result in elements being replaced instead of added, which was observed
in #528 when we static routes are pulled out of the persistent sled DB
and into a BTreeSet before being returned to the API client. The impact
of this is not limited to just `mgadm get-v{4,6}-routes` returning
misleading information, but also missing static routes if mgd were to
restart.

However, this bug does not affect ECMP paths with separate next-hops
since the nexthop field was compared as part of the tie-breaker when
the composite identity check did not match.

Fixes: #528
Add handling for changes to enforce-first-as that doesn't involve a hard
reset of the session. When enabled, walk the RIB and remove paths from
the peer that fail the first-as test. When disabled, send a
route-refresh to the peer so we can re-learn any routes we may have
filtered out.

Fixes: #620
Consolidate paired address-family-specific methods into generic
or AF-parameterized equivalents throughout the RIB database layer.
Origin methods now use the Prefix enum for consistency with existing
static routing methods. Rename BGP removal methods for clarity and fix
over-approximate PCN notifications.
There's a lot of code duplication across {create,get,set}_origin{4,6}
methods, which can be simplified to just use Prefix + AddressFamily.
This does exactly that -- unifies each method to just a single version
that accepts Prefix (AF-agnostic) and uses AddressFamily to distinguish
what should be installed where.
When an Update was marked treat-as-withdraw as a result of an error
(either by the parser or by checks done by the FSM), it was never read
or handled by the FSM. So no treat-as-withdraw error would ever actually
result in routes being withdrawn... even worse, routes that should have
been withdrawn due to errors caught by the parser would get installed if
there wasn't a duplicated check in the FSM.

There were also a couple other bugs that needed fixing too:
1) EnforceFirstAs was unconditionally applied to all peers, but should
   only be applied to eBGP peers.
2) FSM layer failures in check_update() would result in an early return
   rather than a withdrawal, which could lead to routes staying
   installed indefinitely with outdated attributes.

For example, we would see issues as a result of (2) in this scenario:
  A) We receive an UPDATE for Route X with valid attrs
  B) We receive an UPDATE for Route X with invalid attrs

If we return early after (B) then the second UPDATE is ignored, leaving
us with Route X installed with the earlier (now outdated) valid attrs.
The correct response to (B) is to withdraw Route X. Let's not stick our
fingers in our ears and pretend the second UPDATE never happened.

EnforceFirstAs was also explicitly checking for a missing AS_PATH path
attribute, however this was already checked by the parser as a protocol
level error (mandatory attribute missing == treat-as-withdraw). So this
check has been removed, since we now catch treat-as-withdraw flagged by
the parser.
There were multiple places where we kicked off a RIB walker as a result
of an event pertaining to a BGP peer which unconditionally locked and
walked the RIBs for both IPv4 and IPv6. Update these spots to only walk
the RIBs for address-families that have been negotiated with the peer.
Add logic to send a NOTIFICATION if we can't find common ground with our
BGP peer. If the peer sends us MP capabilities, then we treat the AFIs
they send us as an exhaustive list of what they're willing to exchange.
If the peer doesn't send us MP capabilities, then we treat them as a
legacy peer that supports only IPv4 Unicast and only with traditional
encoding. If we haven't enabled IPv4 Unicast and our peer doesn't
support MP-BGP, then we do not have any AFIs in common and we'll reset
the session.

Also adds a basic integration test to ensure that two peers with non
overlapping AFI/SAFIs never move into Established.
Adds a missing call to handle_open() in fsm_established(). This covers
the case where we've enabled deterministic_collision_resolution and we
get an OPEN while we're in Established. The OPEN still needs to be
validated here too.
Updates is_major_event() to categorize more FSM events as major so they
don't live just in the "all" buffer. Now the only events that are not
major are Keepalives, ConnectRetryTimerExpires, IdleHoldTimerExpires,
and KeepaliveTimerExpires. Every other message or event is significant
since they're either low frequency or high importance (triggering an
FSM state change, changing a connection state, updating the RIB, etc.)
Plumb major/all buffers through the BGP Message History, mirroring the
major/all buffer split of the BGP FSM History. In this case, the only
message type that doesn't live in the "major" buffer is keepalives.
Expands the advertised/imported prefix counters to be per AFI.
Adds logic to track the last reset timestamp/reason.
Adds logic to track the last sent/received NOTIFICATION and timestamp.
Adds unit tests to validate the prefix counters and the last reset and
last NOTIFICATION.
Adds clippy lints to deny wildcard match arms within the BGP FSM. This
ensures that any future work will always perform an exhaustive match so
there's no chance of an FSM event handler getting missed. The existing
match arms were already exhaustive in almost all cases, but this adds an
extra level of confidence that it won't change in the future.
Adds a configurable DSCP value for IP packets carrying BGP sessions.

Fixes: #565
Adds logic to update the DSCP and TTL values on TCP sockets in-flight
for all active connections of a SessionRunner. This means that we update
the sockets and allow BGP to manage the aftermath.
e.g.
We won't explicitly close the connection when setting the TTL to 1, but
that doesn't mean the connection isn't doomed. The connection may drop
due to the practical implications of changing the TTL on the socket.

Manual testing confirms (via snoop) that the QoS marking is properly
updated (added/removed/changed) along with the BGP neighbor config,
without causing a reset of the session.
The outgoing hop-limit and incoming min-hop filter for IPv6 BGP
peers were broken in two ways:

1. apply_min_ttl() called TcpStream::set_ttl() unconditionally,
   which sets IP_TTL — an IPv4-only option. For IPv6 sockets this
   is silently ignored. Branch on address family and use
   setsockopt(IPV6_UNICAST_HOPS) for IPv6 peers.

2. set_ip_minttl() used IPV6_HOPLIMIT as the IPv6 equivalent of
   IP_MINTTL. IPV6_HOPLIMIT is a recvmsg ancillary-data option
   that delivers the received hop-limit as a cmsg — it does not
   filter on it. Replace with IPV6_MINHOPCOUNT, which is the
   actual IPv6 counterpart to IP_MINTTL.

Define IPV6_MINHOPCOUNT as a local constant (value 73 for Linux, 47 for
Illumos) since it is not exposed by the libc crate for either platform.

Fixes: #647
Instead of asking the kernel to reset the TTL back to its own default,
let's just manage this explicitly ourselves. This introduces a const
BGP_DEFAULT_TTL which is set to 255 -- we are not a 40 year old routing
platform that needs to have 100 different nerd knobs to make TTL work
in a sane fashion, so let's avoid all the hubbub with making things
different for eBGP/iBGP and just set our TTL to 255 from the get-go.
@taspelund taspelund self-assigned this Feb 27, 2026
@taspelund taspelund added Bug bgp Border Gateway Protocol labels Feb 27, 2026
@taspelund taspelund added mgd Maghemite daemon customer For any bug reports or feature requests tied to customer requests static Static Routing rust Pull requests that update rust code labels Feb 27, 2026
@taspelund
Copy link
Contributor Author

I do not intend for this PR to be considered for R19, but I did want to get it out for review so I can start working on feedback as it comes.

Stop flattening BgpParameters in Neighbor/UnnumberedNeighbor structs.
This allows client to de-duplicate all the common parameters and only
implement logic for it once.

Fixes: #635
Extend BgpPathProperties with origin, peer_ip, and internal
fields and rewrite the bestpath algorithm to follow the complete
RFC 4271 Section 9.1.2.2 decision process: origin preference,
per-AS MED, eBGP over iBGP, multipath decision point, router-id,
and peer-ip tie-breaking.

Move PathOrigin from bgp::messages to rdb-types so it can be
shared across the routing stack. Add BgpPathPropertiesV2/PathV2
for API backward compatibility and new v3 RIB endpoints that
expose the additional fields.

Add comprehensive bestpath tests covering each tie-breaking step
individually and a full-pipeline integration test.

Fixes: #527
Adds a display mode for mgadm rib status imported/selected, which allows
mgadm users to choose whether they want the output in a detailed or
summarized (default) format.
Instead of fetching the full RIB and filtering client-side, add a
prefix query parameter to the RIB imported/selected status endpoints.
This allows a client to query a specific prefix from the RIB instead of
unconditionally pulling the full (potentially filtered) RIB every time.
Removes the filtering of originated routes from a received UPDATE.

Scenario:
1. Peer advertises us a route that we are originating and we discard it.
2. We stop advertising that route.
3. Us withdrawing our advertisement doesn't cause the peer to choose a
   different bestpath for that route.
4. We no longer have a path for that route from this peer.

This could also be addressed by:
A) Sending a route-refresh to the peer when we stop originating routes.
B) Inserting a Local path into the RIB for routes we are originating
   which is preferred over BGP paths learned from peers during bestpath.

(A) is not a good idea because we don't know how large the peer's RIB
will be and we'd have to re-process their entire RIB every time we stop
advertising even a single route.

(B) is fine conceptually, as most routing stacks use this approach:
When using a "network" statement to pull routes from other protocols
into the BGP topology or creating an aggregate, a locally originated
path generally goes into the BGP LOC-RIB that is preferred over any
peer-learned path. If an existing routing table entry doesn't exist,
then this generally creates a blackhole route for the originated prefix
(in order to prevent routing loops from occurring as a result of missing
entries for the component routes).
In the case of maghemite, well, really in the case of dendrite, this
routing loop prevention is done without using blackhole routes: packets
arriving on a front panel port that enocounter a NAT "miss" are dropped
rather than submitted to an LPM lookup in the External routing table
(avoiding us becoming a transit router + hairpinning packets sent to us
from outside the rack). All that to say: we don't need local routes for
any kind of forwarding (or blackholing) data plane behavior.

This brings us to the crux of the issue: Inter-VPC routing.
Since we don't currently have a native inter-VPC data plane within the
rack, then we need to have installed an External route that covers our
originated prefix regardless of what mask is used. If we do an exact
match check and filtering, then we just make it harder for our peer to
advertise us a route we need. We effectively require the peer to
advertise us reachability to our own prefix, so long as the mask doesn't
match what we're using (e.g. if we send a /24, they have to send us any
mask other than a /24 that covers the original IP range).

So while removing this filter doesn't address the issues inherent to
us needing the peer to provide us a route back to our own address space,
it does make it simpler for us to learn that route (no longer needing
a route with anything but our own mask) and to avoid extra logic to
ensure we can recover routes that we filtered.

Fixes: #583
Adds a justfile for running simple build, test and verification commands
with some logic in place for platform-specific exclusions. Updates dev
docs to mention justfile for folks who want to use it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bgp Border Gateway Protocol Bug customer For any bug reports or feature requests tied to customer requests mgd Maghemite daemon rust Pull requests that update rust code static Static Routing

Projects

None yet

1 participant