Open
Conversation
Three tests failed under cargo test (thread-per-test) but passed under cargo nextest (process-per-test) due to shared global state in the channel-based test network simulation: 1. unnumbered_peering_helper() leaked dispatcher threads by not returning dispatcher handles to callers. Leaked threads would periodically re-bind to addresses in the global NET endpoint map, stealing entries from concurrently running tests. 2. The simulated Network had no unbind mechanism, so endpoint entries persisted after a Listener was dropped, leaving stale state for subsequent tests. 3. Several tests used hardcoded scope_ids (2, 3) for link-local addresses, causing collisions in the shared NET endpoint map when tests ran as concurrent threads in the same process. Fix by: - Adding Network::unbind() and Listener::Drop to clean up NET endpoint entries when a listener goes out of scope - Returning dispatchers from unnumbered_peering_helper() so all callers can shut them down during test cleanup - Switching hardcoded scope_ids to next_scope_id() for unique per-test addresses, matching the newer test infrastructure
This adds logic to send_update() to be able to split a large amount of prefixes that would cause the Update to exceed the max BGP packet size. Now, if we have more originated prefixes than can fit into a single Update, we will split the prefixes into chunks that fit the remaining available space (max size - header - fixed length fields - attributes). This also updates the basic_update* tests to exchange and validate 4k routes instead of just 1. This not only exercises the Update chunking, but also recreates the conditions where #634 was observed. Fixes: #614
Allow static routes to use cross-address-family next-hops (e.g. IPv4 prefix via IPv6 next-hop and vice versa). Introduce API version 7 (EXTENDED_NH_STATIC) with StaticRoute4 and StaticRoute6 types that accept IpAddr for the nexthop field instead of the AF-specific Ipv4Addr/Ipv6Addr. The previous types are retired as V1 variants with version-gated endpoints. The internal layers (StaticRouteKey, Path, RIB) already handle cross-AF next-hops; this closes the gap at the API boundary. Closes: #581
When stdout is piped into a program that closes early (e.g. `mgadm bgp status neighbors 47 | head`), Rust's println! macro panics on the resulting EPIPE, causing a SIGABRT. Add println_nopipe!/print_nopipe! macros to mg-common that silently exit on BrokenPipe instead of panicking. Use them throughout mgadm's non-test print calls. Also catch EPIPE errors propagated via ? from TabWriter writeln/flush paths at both mgadm and ddmadm main() boundaries. Fixes #567
The Ord impl we have for Path was only checking nexthop_interface and vlan_id for an identity check which returns Ordering::Equal if bgp is None and the (nexthop, nexthop_interface, vlan_id) are the same. However, when all these fields are not the same that impl moved onto a tie-breaker which compared the nexthop before moving onto the bgp cmp, without ever doing a cmp for nexthop_interface or vlan_id. This meant that any two Paths with the same nexthop but different nexthop_interface and/or vlan_id would be treated as identical paths. BTreeSet relies on this Ord impl to determine whether an insertion should replace or add an element. The above issue in the Ord impl would result in elements being replaced instead of added, which was observed in #528 when we static routes are pulled out of the persistent sled DB and into a BTreeSet before being returned to the API client. The impact of this is not limited to just `mgadm get-v{4,6}-routes` returning misleading information, but also missing static routes if mgd were to restart. However, this bug does not affect ECMP paths with separate next-hops since the nexthop field was compared as part of the tie-breaker when the composite identity check did not match. Fixes: #528
Add handling for changes to enforce-first-as that doesn't involve a hard reset of the session. When enabled, walk the RIB and remove paths from the peer that fail the first-as test. When disabled, send a route-refresh to the peer so we can re-learn any routes we may have filtered out. Fixes: #620
Consolidate paired address-family-specific methods into generic or AF-parameterized equivalents throughout the RIB database layer. Origin methods now use the Prefix enum for consistency with existing static routing methods. Rename BGP removal methods for clarity and fix over-approximate PCN notifications.
There's a lot of code duplication across {create,get,set}_origin{4,6}
methods, which can be simplified to just use Prefix + AddressFamily.
This does exactly that -- unifies each method to just a single version
that accepts Prefix (AF-agnostic) and uses AddressFamily to distinguish
what should be installed where.
When an Update was marked treat-as-withdraw as a result of an error (either by the parser or by checks done by the FSM), it was never read or handled by the FSM. So no treat-as-withdraw error would ever actually result in routes being withdrawn... even worse, routes that should have been withdrawn due to errors caught by the parser would get installed if there wasn't a duplicated check in the FSM. There were also a couple other bugs that needed fixing too: 1) EnforceFirstAs was unconditionally applied to all peers, but should only be applied to eBGP peers. 2) FSM layer failures in check_update() would result in an early return rather than a withdrawal, which could lead to routes staying installed indefinitely with outdated attributes. For example, we would see issues as a result of (2) in this scenario: A) We receive an UPDATE for Route X with valid attrs B) We receive an UPDATE for Route X with invalid attrs If we return early after (B) then the second UPDATE is ignored, leaving us with Route X installed with the earlier (now outdated) valid attrs. The correct response to (B) is to withdraw Route X. Let's not stick our fingers in our ears and pretend the second UPDATE never happened. EnforceFirstAs was also explicitly checking for a missing AS_PATH path attribute, however this was already checked by the parser as a protocol level error (mandatory attribute missing == treat-as-withdraw). So this check has been removed, since we now catch treat-as-withdraw flagged by the parser.
There were multiple places where we kicked off a RIB walker as a result of an event pertaining to a BGP peer which unconditionally locked and walked the RIBs for both IPv4 and IPv6. Update these spots to only walk the RIBs for address-families that have been negotiated with the peer.
Add logic to send a NOTIFICATION if we can't find common ground with our BGP peer. If the peer sends us MP capabilities, then we treat the AFIs they send us as an exhaustive list of what they're willing to exchange. If the peer doesn't send us MP capabilities, then we treat them as a legacy peer that supports only IPv4 Unicast and only with traditional encoding. If we haven't enabled IPv4 Unicast and our peer doesn't support MP-BGP, then we do not have any AFIs in common and we'll reset the session. Also adds a basic integration test to ensure that two peers with non overlapping AFI/SAFIs never move into Established.
Adds a missing call to handle_open() in fsm_established(). This covers the case where we've enabled deterministic_collision_resolution and we get an OPEN while we're in Established. The OPEN still needs to be validated here too.
Updates is_major_event() to categorize more FSM events as major so they don't live just in the "all" buffer. Now the only events that are not major are Keepalives, ConnectRetryTimerExpires, IdleHoldTimerExpires, and KeepaliveTimerExpires. Every other message or event is significant since they're either low frequency or high importance (triggering an FSM state change, changing a connection state, updating the RIB, etc.)
Plumb major/all buffers through the BGP Message History, mirroring the major/all buffer split of the BGP FSM History. In this case, the only message type that doesn't live in the "major" buffer is keepalives.
Expands the advertised/imported prefix counters to be per AFI. Adds logic to track the last reset timestamp/reason. Adds logic to track the last sent/received NOTIFICATION and timestamp. Adds unit tests to validate the prefix counters and the last reset and last NOTIFICATION.
Adds clippy lints to deny wildcard match arms within the BGP FSM. This ensures that any future work will always perform an exhaustive match so there's no chance of an FSM event handler getting missed. The existing match arms were already exhaustive in almost all cases, but this adds an extra level of confidence that it won't change in the future.
Adds a configurable DSCP value for IP packets carrying BGP sessions. Fixes: #565
Adds logic to update the DSCP and TTL values on TCP sockets in-flight for all active connections of a SessionRunner. This means that we update the sockets and allow BGP to manage the aftermath. e.g. We won't explicitly close the connection when setting the TTL to 1, but that doesn't mean the connection isn't doomed. The connection may drop due to the practical implications of changing the TTL on the socket. Manual testing confirms (via snoop) that the QoS marking is properly updated (added/removed/changed) along with the BGP neighbor config, without causing a reset of the session.
The outgoing hop-limit and incoming min-hop filter for IPv6 BGP peers were broken in two ways: 1. apply_min_ttl() called TcpStream::set_ttl() unconditionally, which sets IP_TTL — an IPv4-only option. For IPv6 sockets this is silently ignored. Branch on address family and use setsockopt(IPV6_UNICAST_HOPS) for IPv6 peers. 2. set_ip_minttl() used IPV6_HOPLIMIT as the IPv6 equivalent of IP_MINTTL. IPV6_HOPLIMIT is a recvmsg ancillary-data option that delivers the received hop-limit as a cmsg — it does not filter on it. Replace with IPV6_MINHOPCOUNT, which is the actual IPv6 counterpart to IP_MINTTL. Define IPV6_MINHOPCOUNT as a local constant (value 73 for Linux, 47 for Illumos) since it is not exposed by the libc crate for either platform. Fixes: #647
Instead of asking the kernel to reset the TTL back to its own default, let's just manage this explicitly ourselves. This introduces a const BGP_DEFAULT_TTL which is set to 255 -- we are not a 40 year old routing platform that needs to have 100 different nerd knobs to make TTL work in a sane fashion, so let's avoid all the hubbub with making things different for eBGP/iBGP and just set our TTL to 255 from the get-go.
Contributor
Author
|
I do not intend for this PR to be considered for R19, but I did want to get it out for review so I can start working on feedback as it comes. |
Stop flattening BgpParameters in Neighbor/UnnumberedNeighbor structs. This allows client to de-duplicate all the common parameters and only implement logic for it once. Fixes: #635
Extend BgpPathProperties with origin, peer_ip, and internal fields and rewrite the bestpath algorithm to follow the complete RFC 4271 Section 9.1.2.2 decision process: origin preference, per-AS MED, eBGP over iBGP, multipath decision point, router-id, and peer-ip tie-breaking. Move PathOrigin from bgp::messages to rdb-types so it can be shared across the routing stack. Add BgpPathPropertiesV2/PathV2 for API backward compatibility and new v3 RIB endpoints that expose the additional fields. Add comprehensive bestpath tests covering each tie-breaking step individually and a full-pipeline integration test. Fixes: #527
Adds a display mode for mgadm rib status imported/selected, which allows mgadm users to choose whether they want the output in a detailed or summarized (default) format.
Instead of fetching the full RIB and filtering client-side, add a prefix query parameter to the RIB imported/selected status endpoints. This allows a client to query a specific prefix from the RIB instead of unconditionally pulling the full (potentially filtered) RIB every time.
Removes the filtering of originated routes from a received UPDATE. Scenario: 1. Peer advertises us a route that we are originating and we discard it. 2. We stop advertising that route. 3. Us withdrawing our advertisement doesn't cause the peer to choose a different bestpath for that route. 4. We no longer have a path for that route from this peer. This could also be addressed by: A) Sending a route-refresh to the peer when we stop originating routes. B) Inserting a Local path into the RIB for routes we are originating which is preferred over BGP paths learned from peers during bestpath. (A) is not a good idea because we don't know how large the peer's RIB will be and we'd have to re-process their entire RIB every time we stop advertising even a single route. (B) is fine conceptually, as most routing stacks use this approach: When using a "network" statement to pull routes from other protocols into the BGP topology or creating an aggregate, a locally originated path generally goes into the BGP LOC-RIB that is preferred over any peer-learned path. If an existing routing table entry doesn't exist, then this generally creates a blackhole route for the originated prefix (in order to prevent routing loops from occurring as a result of missing entries for the component routes). In the case of maghemite, well, really in the case of dendrite, this routing loop prevention is done without using blackhole routes: packets arriving on a front panel port that enocounter a NAT "miss" are dropped rather than submitted to an LPM lookup in the External routing table (avoiding us becoming a transit router + hairpinning packets sent to us from outside the rack). All that to say: we don't need local routes for any kind of forwarding (or blackholing) data plane behavior. This brings us to the crux of the issue: Inter-VPC routing. Since we don't currently have a native inter-VPC data plane within the rack, then we need to have installed an External route that covers our originated prefix regardless of what mask is used. If we do an exact match check and filtering, then we just make it harder for our peer to advertise us a route we need. We effectively require the peer to advertise us reachability to our own prefix, so long as the mask doesn't match what we're using (e.g. if we send a /24, they have to send us any mask other than a /24 that covers the original IP range). So while removing this filter doesn't address the issues inherent to us needing the peer to provide us a route back to our own address space, it does make it simpler for us to learn that route (no longer needing a route with anything but our own mask) and to avoid extra logic to ensure we can recover routes that we filtered. Fixes: #583
Adds a justfile for running simple build, test and verification commands with some logic in place for platform-specific exclusions. Updates dev docs to mention justfile for folks who want to use it.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Each commit has a meaningful scope and message, and can be read in order when reviewing to help chunk up the PR into digestible chunks. Summary of the changes below.
BGP Protocol Correctness
QoS & Socket Management
Observability
RIB & Static Routing
Graceful Config Changes
Other
Issues Fixed
#527, #528, #565, #567, #581, #583, #614, #620, #647