Spring Cleaning by taspelund · Pull Request #648 · oxidecomputer/maghemite

taspelund · 2026-02-27T03:42:48Z

Each commit has a meaningful scope and message, and can be read in order when reviewing to help chunk up the PR into digestible chunks. Summary of the changes below.

BGP Protocol Correctness

treat-as-withdraw bug fixes — Updates flagged treat-as-withdraw by the parser were never actually withdrawn; fixed along with enforce-first-AS scoping (eBGP only) and FSM-layer error handling
AFI/SAFI negotiation — Send NOTIFICATION when no common address families exist; legacy peers without MP-BGP are treated as IPv4 Unicast only
Validate OPEN in Established — Handle OPEN messages received while in Established state (deterministic collision resolution)
Disallow wildcard match arms in FSM — Clippy lints ensure exhaustive matching so future events can't be silently dropped
Skip RIB walks for non-negotiated AFIs — Avoid locking/walking RIBs for address families the peer doesn't support
BGP Bestpath is updated to better align with RFC 4271 (BGP bestpath algorithm should more closely follow RFC 4271 #527): adds comparison of origin path attribute and internal vs external (ibgp vs ebgp), adds checks after each step to see if there is only a single path left (grab single bestpath early if present), adds an explicit multipath check before progressing into new tie-breakers (router-id + peer-ip)
Stop unconditionally filtering locally originated prefixes from received advertisements/withdrawals (bgp: consider removing unconditional rx filtering of originated routes #583)

QoS & Socket Management

DSCP / Traffic Class support — Configurable DSCP marking for BGP packets (fixes BGP should set DSCP bits #565), with graceful in-flight socket updates (no session reset)
Fix IPv6 TTL/hop-limit sockopts — Use IPV6_UNICAST_HOPS (not IP_TTL) and IPV6_MINHOPCOUNT (not IPV6_HOPLIMIT) for IPv6 peers (fixes TTL settings are broken for IPv6 peers #647)
Default TTL of 255 — Explicit BGP_DEFAULT_TTL = 255 instead of relying on kernel defaults
TCP_NODELAY — Set on all BGP TCP sessions

Observability

Per-AFI prefix counters, last reset reason/timestamp, last NOTIFICATION tracking
Message history split — Major vs all buffers (keepalives stay in "all" only)
More FSM events classified as major

RIB & Static Routing

RDB deduplication — Consolidated v4/v6 dispatch into AF-parameterized methods
AF-agnostic origin methods — Unified create/get/set_origin{4,6} into single Prefix-based methods
Fix Ord for static paths — Paths with same nexthop but different nexthop_interface/vlan_id were incorrectly treated as identical (fixes Static route funkiness when adding the same route via the same next-hop but with different vlans #528)
Extended next-hop support — Cross-AF static route next-hops via API v7 (fixes static: want extended next-hop support #581)
Plumb nexthop_interface through static routing

Graceful Config Changes

enforce-first-AS — Walk RIB and withdraw non-compliant paths on enable; send route-refresh on disable (fixes bgp: making enforce_first_as config handler more graceful #620)
DSCP/TTL updates — Modify sockets in-flight without resetting sessions

Other

Update chunking — Split originated BGP Updates that exceed max packet size (fixes bgp: need handling for originated updates that exceed the max update size #614)
Unnumbered peers — Ensure at least one AFI is enabled
Broken pipe handling — println_nopipe! macro for mgadm/ddmadm (fixes mgadm aborted with rust panic #567)
Test isolation fixes — Proper cleanup for channel-based tests under cargo test
Dev docs added

Issues Fixed

#527, #528, #565, #567, #581, #583, #614, #620, #647

Three tests failed under cargo test (thread-per-test) but passed under cargo nextest (process-per-test) due to shared global state in the channel-based test network simulation: 1. unnumbered_peering_helper() leaked dispatcher threads by not returning dispatcher handles to callers. Leaked threads would periodically re-bind to addresses in the global NET endpoint map, stealing entries from concurrently running tests. 2. The simulated Network had no unbind mechanism, so endpoint entries persisted after a Listener was dropped, leaving stale state for subsequent tests. 3. Several tests used hardcoded scope_ids (2, 3) for link-local addresses, causing collisions in the shared NET endpoint map when tests ran as concurrent threads in the same process. Fix by: - Adding Network::unbind() and Listener::Drop to clean up NET endpoint entries when a listener goes out of scope - Returning dispatchers from unnumbered_peering_helper() so all callers can shut them down during test cleanup - Switching hardcoded scope_ids to next_scope_id() for unique per-test addresses, matching the newer test infrastructure

This adds logic to send_update() to be able to split a large amount of prefixes that would cause the Update to exceed the max BGP packet size. Now, if we have more originated prefixes than can fit into a single Update, we will split the prefixes into chunks that fit the remaining available space (max size - header - fixed length fields - attributes). This also updates the basic_update* tests to exchange and validate 4k routes instead of just 1. This not only exercises the Update chunking, but also recreates the conditions where #634 was observed. Fixes: #614

Allow static routes to use cross-address-family next-hops (e.g. IPv4 prefix via IPv6 next-hop and vice versa). Introduce API version 7 (EXTENDED_NH_STATIC) with StaticRoute4 and StaticRoute6 types that accept IpAddr for the nexthop field instead of the AF-specific Ipv4Addr/Ipv6Addr. The previous types are retired as V1 variants with version-gated endpoints. The internal layers (StaticRouteKey, Path, RIB) already handle cross-AF next-hops; this closes the gap at the API boundary. Closes: #581

When stdout is piped into a program that closes early (e.g. `mgadm bgp status neighbors 47 | head`), Rust's println! macro panics on the resulting EPIPE, causing a SIGABRT. Add println_nopipe!/print_nopipe! macros to mg-common that silently exit on BrokenPipe instead of panicking. Use them throughout mgadm's non-test print calls. Also catch EPIPE errors propagated via ? from TabWriter writeln/flush paths at both mgadm and ddmadm main() boundaries. Fixes #567

The Ord impl we have for Path was only checking nexthop_interface and vlan_id for an identity check which returns Ordering::Equal if bgp is None and the (nexthop, nexthop_interface, vlan_id) are the same. However, when all these fields are not the same that impl moved onto a tie-breaker which compared the nexthop before moving onto the bgp cmp, without ever doing a cmp for nexthop_interface or vlan_id. This meant that any two Paths with the same nexthop but different nexthop_interface and/or vlan_id would be treated as identical paths. BTreeSet relies on this Ord impl to determine whether an insertion should replace or add an element. The above issue in the Ord impl would result in elements being replaced instead of added, which was observed in #528 when we static routes are pulled out of the persistent sled DB and into a BTreeSet before being returned to the API client. The impact of this is not limited to just `mgadm get-v{4,6}-routes` returning misleading information, but also missing static routes if mgd were to restart. However, this bug does not affect ECMP paths with separate next-hops since the nexthop field was compared as part of the tie-breaker when the composite identity check did not match. Fixes: #528

Add handling for changes to enforce-first-as that doesn't involve a hard reset of the session. When enabled, walk the RIB and remove paths from the peer that fail the first-as test. When disabled, send a route-refresh to the peer so we can re-learn any routes we may have filtered out. Fixes: #620

Consolidate paired address-family-specific methods into generic or AF-parameterized equivalents throughout the RIB database layer. Origin methods now use the Prefix enum for consistency with existing static routing methods. Rename BGP removal methods for clarity and fix over-approximate PCN notifications.

There's a lot of code duplication across {create,get,set}_origin{4,6} methods, which can be simplified to just use Prefix + AddressFamily. This does exactly that -- unifies each method to just a single version that accepts Prefix (AF-agnostic) and uses AddressFamily to distinguish what should be installed where.

When an Update was marked treat-as-withdraw as a result of an error (either by the parser or by checks done by the FSM), it was never read or handled by the FSM. So no treat-as-withdraw error would ever actually result in routes being withdrawn... even worse, routes that should have been withdrawn due to errors caught by the parser would get installed if there wasn't a duplicated check in the FSM. There were also a couple other bugs that needed fixing too: 1) EnforceFirstAs was unconditionally applied to all peers, but should only be applied to eBGP peers. 2) FSM layer failures in check_update() would result in an early return rather than a withdrawal, which could lead to routes staying installed indefinitely with outdated attributes. For example, we would see issues as a result of (2) in this scenario: A) We receive an UPDATE for Route X with valid attrs B) We receive an UPDATE for Route X with invalid attrs If we return early after (B) then the second UPDATE is ignored, leaving us with Route X installed with the earlier (now outdated) valid attrs. The correct response to (B) is to withdraw Route X. Let's not stick our fingers in our ears and pretend the second UPDATE never happened. EnforceFirstAs was also explicitly checking for a missing AS_PATH path attribute, however this was already checked by the parser as a protocol level error (mandatory attribute missing == treat-as-withdraw). So this check has been removed, since we now catch treat-as-withdraw flagged by the parser.

There were multiple places where we kicked off a RIB walker as a result of an event pertaining to a BGP peer which unconditionally locked and walked the RIBs for both IPv4 and IPv6. Update these spots to only walk the RIBs for address-families that have been negotiated with the peer.

Add logic to send a NOTIFICATION if we can't find common ground with our BGP peer. If the peer sends us MP capabilities, then we treat the AFIs they send us as an exhaustive list of what they're willing to exchange. If the peer doesn't send us MP capabilities, then we treat them as a legacy peer that supports only IPv4 Unicast and only with traditional encoding. If we haven't enabled IPv4 Unicast and our peer doesn't support MP-BGP, then we do not have any AFIs in common and we'll reset the session. Also adds a basic integration test to ensure that two peers with non overlapping AFI/SAFIs never move into Established.

Adds a missing call to handle_open() in fsm_established(). This covers the case where we've enabled deterministic_collision_resolution and we get an OPEN while we're in Established. The OPEN still needs to be validated here too.

Updates is_major_event() to categorize more FSM events as major so they don't live just in the "all" buffer. Now the only events that are not major are Keepalives, ConnectRetryTimerExpires, IdleHoldTimerExpires, and KeepaliveTimerExpires. Every other message or event is significant since they're either low frequency or high importance (triggering an FSM state change, changing a connection state, updating the RIB, etc.)

Plumb major/all buffers through the BGP Message History, mirroring the major/all buffer split of the BGP FSM History. In this case, the only message type that doesn't live in the "major" buffer is keepalives.

Expands the advertised/imported prefix counters to be per AFI. Adds logic to track the last reset timestamp/reason. Adds logic to track the last sent/received NOTIFICATION and timestamp. Adds unit tests to validate the prefix counters and the last reset and last NOTIFICATION.

Adds clippy lints to deny wildcard match arms within the BGP FSM. This ensures that any future work will always perform an exhaustive match so there's no chance of an FSM event handler getting missed. The existing match arms were already exhaustive in almost all cases, but this adds an extra level of confidence that it won't change in the future.

Adds a configurable DSCP value for IP packets carrying BGP sessions. Fixes: #565

Adds logic to update the DSCP and TTL values on TCP sockets in-flight for all active connections of a SessionRunner. This means that we update the sockets and allow BGP to manage the aftermath. e.g. We won't explicitly close the connection when setting the TTL to 1, but that doesn't mean the connection isn't doomed. The connection may drop due to the practical implications of changing the TTL on the socket. Manual testing confirms (via snoop) that the QoS marking is properly updated (added/removed/changed) along with the BGP neighbor config, without causing a reset of the session.

The outgoing hop-limit and incoming min-hop filter for IPv6 BGP peers were broken in two ways: 1. apply_min_ttl() called TcpStream::set_ttl() unconditionally, which sets IP_TTL — an IPv4-only option. For IPv6 sockets this is silently ignored. Branch on address family and use setsockopt(IPV6_UNICAST_HOPS) for IPv6 peers. 2. set_ip_minttl() used IPV6_HOPLIMIT as the IPv6 equivalent of IP_MINTTL. IPV6_HOPLIMIT is a recvmsg ancillary-data option that delivers the received hop-limit as a cmsg — it does not filter on it. Replace with IPV6_MINHOPCOUNT, which is the actual IPv6 counterpart to IP_MINTTL. Define IPV6_MINHOPCOUNT as a local constant (value 73 for Linux, 47 for Illumos) since it is not exposed by the libc crate for either platform. Fixes: #647

Instead of asking the kernel to reset the TTL back to its own default, let's just manage this explicitly ourselves. This introduces a const BGP_DEFAULT_TTL which is set to 255 -- we are not a 40 year old routing platform that needs to have 100 different nerd knobs to make TTL work in a sane fashion, so let's avoid all the hubbub with making things different for eBGP/iBGP and just set our TTL to 255 from the get-go.

taspelund · 2026-02-27T03:46:43Z

I do not intend for this PR to be considered for R19, but I did want to get it out for review so I can start working on feedback as it comes.

Stop flattening BgpParameters in Neighbor/UnnumberedNeighbor structs. This allows client to de-duplicate all the common parameters and only implement logic for it once. Fixes: #635

Extend BgpPathProperties with origin, peer_ip, and internal fields and rewrite the bestpath algorithm to follow the complete RFC 4271 Section 9.1.2.2 decision process: origin preference, per-AS MED, eBGP over iBGP, multipath decision point, router-id, and peer-ip tie-breaking. Move PathOrigin from bgp::messages to rdb-types so it can be shared across the routing stack. Add BgpPathPropertiesV2/PathV2 for API backward compatibility and new v3 RIB endpoints that expose the additional fields. Add comprehensive bestpath tests covering each tie-breaking step individually and a full-pipeline integration test. Fixes: #527

Adds a display mode for mgadm rib status imported/selected, which allows mgadm users to choose whether they want the output in a detailed or summarized (default) format.

Instead of fetching the full RIB and filtering client-side, add a prefix query parameter to the RIB imported/selected status endpoints. This allows a client to query a specific prefix from the RIB instead of unconditionally pulling the full (potentially filtered) RIB every time.

Removes the filtering of originated routes from a received UPDATE. Scenario: 1. Peer advertises us a route that we are originating and we discard it. 2. We stop advertising that route. 3. Us withdrawing our advertisement doesn't cause the peer to choose a different bestpath for that route. 4. We no longer have a path for that route from this peer. This could also be addressed by: A) Sending a route-refresh to the peer when we stop originating routes. B) Inserting a Local path into the RIB for routes we are originating which is preferred over BGP paths learned from peers during bestpath. (A) is not a good idea because we don't know how large the peer's RIB will be and we'd have to re-process their entire RIB every time we stop advertising even a single route. (B) is fine conceptually, as most routing stacks use this approach: When using a "network" statement to pull routes from other protocols into the BGP topology or creating an aggregate, a locally originated path generally goes into the BGP LOC-RIB that is preferred over any peer-learned path. If an existing routing table entry doesn't exist, then this generally creates a blackhole route for the originated prefix (in order to prevent routing loops from occurring as a result of missing entries for the component routes). In the case of maghemite, well, really in the case of dendrite, this routing loop prevention is done without using blackhole routes: packets arriving on a front panel port that enocounter a NAT "miss" are dropped rather than submitted to an LPM lookup in the External routing table (avoiding us becoming a transit router + hairpinning packets sent to us from outside the rack). All that to say: we don't need local routes for any kind of forwarding (or blackholing) data plane behavior. This brings us to the crux of the issue: Inter-VPC routing. Since we don't currently have a native inter-VPC data plane within the rack, then we need to have installed an External route that covers our originated prefix regardless of what mask is used. If we do an exact match check and filtering, then we just make it harder for our peer to advertise us a route we need. We effectively require the peer to advertise us reachability to our own prefix, so long as the mask doesn't match what we're using (e.g. if we send a /24, they have to send us any mask other than a /24 that covers the original IP range). So while removing this filter doesn't address the issues inherent to us needing the peer to provide us a route back to our own address space, it does make it simpler for us to learn that route (no longer needing a route with anything but our own mask) and to avoid extra logic to ensure we can recover routes that we filtered. Fixes: #583

Adds a justfile for running simple build, test and verification commands with some logic in place for platform-specific exclusions. Updates dev docs to mention justfile for folks who want to use it.

taspelund added 24 commits February 19, 2026 12:01

bgp: set TCP_NODELAY on all TCP sessions

58813e7

Ensure an AFI is enabled for unnumbered peers too

ff9383d

Plumb nexthop_interface through static routing

3850edd

bgp: validate OPEN received in Established

8daf8ab

Adds a missing call to handle_open() in fsm_established(). This covers the case where we've enabled deterministic_collision_resolution and we get an OPEN while we're in Established. The OPEN still needs to be validated here too.

Split BGP Message History into major/all buffers

634dc33

Plumb major/all buffers through the BGP Message History, mirroring the major/all buffer split of the BGP FSM History. In this case, the only message type that doesn't live in the "major" buffer is keepalives.

Add dev docs

a146209

Enable IP DSCP / IPv6 Traffic Class for BGP peers

f1497e3

Adds a configurable DSCP value for IP packets carrying BGP sessions. Fixes: #565

taspelund requested review from internet-diglett, rcgoodfellow and zeeshanlakhani February 27, 2026 03:42

taspelund self-assigned this Feb 27, 2026

taspelund added Bug bgp Border Gateway Protocol labels Feb 27, 2026

taspelund added mgd Maghemite daemon customer For any bug reports or feature requests tied to customer requests static Static Routing rust Pull requests that update rust code labels Feb 27, 2026

taspelund added 8 commits February 27, 2026 07:19

cargo fmt gooder

053af1a

un-flatten BgpParameters

542ef51

Stop flattening BgpParameters in Neighbor/UnnumberedNeighbor structs. This allows client to de-duplicate all the common parameters and only implement logic for it once. Fixes: #635

Add detail option for mgadm rib status

412462e

Adds a display mode for mgadm rib status imported/selected, which allows mgadm users to choose whether they want the output in a detailed or summarized (default) format.

bfd: add ipv6 peers to bfd tests

c9bde3b

Add justfile

c16a85b

Adds a justfile for running simple build, test and verification commands with some logic in place for platform-specific exclusions. Updates dev docs to mention justfile for folks who want to use it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spring Cleaning#648

Spring Cleaning#648
taspelund wants to merge 32 commits intomainfrom
trey/random-fix-fun-time

taspelund commented Feb 27, 2026 •

edited

Loading

Uh oh!

taspelund commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

taspelund commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taspelund commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

taspelund commented Feb 27, 2026 •

edited

Loading