Skip to content

Conversation

@qmonnet
Copy link
Member

@qmonnet qmonnet commented Dec 13, 2025

Stateless Port and Address Translation (PAT) for the dataplane.

Best reviewed per-commit, but hold on tight, this PR is dense.

Follow-ups include:

  • Obviously, passing the port ranges from the user config.
  • Adding port ranges support in stateful NAT (we already support NAT-ing ports in stateful NAT, the update should include allocating only from allowed target ranges, dropping packets for packets not coming from allowed ranges).
  • ICMP. How do we do ICMP with PAT?? Still to be determined.
  • Probably some more optimisations or code clarifications here and there.

@qmonnet qmonnet added this to the GW R2 milestone Dec 13, 2025
@qmonnet qmonnet self-assigned this Dec 13, 2025
@qmonnet qmonnet added area/nat Related to Network Address Translation (NAT) ci:-upgrade Disable VLAB upgrade tests labels Dec 13, 2025
@qmonnet qmonnet force-pushed the pr/qmonnet/pat branch 7 times, most recently from 6b5ba67 to a3c771f Compare December 17, 2025 00:18
@qmonnet qmonnet requested a review from mvachhar December 17, 2025 00:22
@qmonnet
Copy link
Member Author

qmonnet commented Dec 17, 2025

I think it's ready for a first pass

@qmonnet qmonnet marked this pull request as ready for review December 17, 2025 00:23
@qmonnet qmonnet requested a review from a team as a code owner December 17, 2025 00:23
@qmonnet qmonnet requested a review from Copilot December 17, 2025 00:23
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements stateless Port Address Translation (PAT) for the dataplane, adding support for port ranges in NAT operations. The implementation extends the existing prefix-based NAT system to handle combinations of IP prefixes and port ranges, enabling fine-grained control over which ports are translated.

Key Changes:

  • Introduced PrefixWithOptionalPorts and PortRange types to represent IP prefixes with associated port ranges
  • Extended NAT tables to support both address-only translation (NAT) and port+address translation (PAT)
  • Updated lookup and translation logic throughout the dataplane to consider port ranges

Reviewed changes

Copilot reviewed 27 out of 28 changed files in this pull request and generated no comments.

Show a summary per file
File Description
lpm/src/prefix/with_ports.rs New module defining port range types and IP prefix/port combinations
lpm/src/prefix/range_map.rs New DisjointRangesBTreeMap for efficient range-based lookups
nat/src/stateless/setup/tables.rs Extended NAT tables with PAT support, including PortAddrTranslationValue
nat/src/stateless/setup/range_builder.rs Updated range builder to handle port ranges in offset calculations
nat/src/stateless/mod.rs Added port translation methods and integrated port lookup
pkt-meta/src/dst_vpcd_lookup/mod.rs Updated destination VPC lookup to consider port ranges
config/src/utils/collapse.rs Simplified prefix collapsing to use new port range subtraction logic
config/src/external/overlay/vpcpeering.rs Updated VpcExpose validation for port-aware prefix overlapping
nat/src/stateless/test.rs Added comprehensive tests for PAT functionality
lpm/src/trie/*.rs Added matching_entries method for finding all matching prefixes

/// If the port range is empty, this method should return the number of addresses in the address
/// range: so this will produce a different value than an object with the same IP range, but
/// with a port range covering all existing ports.
fn size(&self) -> PrefixWithPortsSize {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make this semantic distinction? is it just to allow compatibility between an ip range and an ip range with ports?

In other words, if this trait is called IpRangeWithPorts, then should it always count (Ip, port) pairs available, so no port ranges means the size is the same as all port ranges?

If we what this to mean something like IpRangeOrIpRangeWithPorts (terrible name, we should choose something shorter), then the better way to say this is that it returns the available size of items. For an IpRange that would be the number of IPs for an IPPortRange, that would be a different thing entirely.

Perhaps a discussion on the correct design here is in order.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is in fact outdated. I had a previous version of the code that made a distinction, but it's been changed a while ago. I'll remove the comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment removed

@@ -0,0 +1,299 @@
// SPDX-License-Identifier: Apache-2.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are the tests for this new functionality? I prefer fuzz tests if they are easy, but there should at least be some tests right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I see what functionality you refer to, in particular. There are some unit tests in follow-up commits after the introduction of these structs and methods.

Fuzzing is tricky, most of the interesting methods to fuzz require re-implementing the logic in the fuzzer to validate correctness. I don't have any fuzzing tests in this PR, we could look into it but it would probably be for a follow-up.

#[derive(Debug, Clone)]
enum VpcDiscriminantTableValue {
PortRangeTable(DisjointRangesBTreeMap<PortRange, Option<VpcDiscriminant>>),
NoPorts(Option<VpcDiscriminant>),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I like this enumeration approach better for the IpRangeWithPorts trait as well instead of using an empty port range to mean IPs only. It might clean up my earlier comment about the size method as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, does the order here matter? Do we need a BTreeMap or would a HashMap suffice? For small overlaps though BTreeMap might be better. Just thought I'd point out the tradeoff. What do we expect for size here if people are mapping port ranges from one ISP provided IP to many internal hosts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I like this enumeration approach better for the IpRangeWithPorts trait as well instead of using an empty port range to mean IPs only. It might clean up my earlier comment about the size method as well.

I'll see if I can improve it.

Also, does the order here matter? Do we need a BTreeMap or would a HashMap suffice? For small overlaps though BTreeMap might be better. Just thought I'd point out the tradeoff.

I don't think a HashMap would do. For a given IP address, I don't know how to find the matching prefix used as a key in the HashMap without doing a linear search on the entries; for an ordered BTreeMap, I can do that.

What do we expect for size here if people are mapping port ranges from one ISP provided IP to many internal hosts?

One BTreeMap per port range, with the associated IPs. You need to store these mappings one way or another anyway, unless you find a way to “compress” it the difference is the overhead related to the data structure in use, I think

Copy link
Member Author

@qmonnet qmonnet Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a commit at the end to use an enum for PrefixWithOptionalPorts, I think this is what you meant (but I'm not 100% sure). Please take a look at it.

refactor(lpm): Use enum for PrefixWithOptionalPorts


#[cfg(test)]
mod tests {
use super::*;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so now we have some tests, but still missing the basic tests for overlap.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What tests are you referring to, exactly?

As discussed during our call, I'll add some tests to see how excluding a range of ports for the full range of IPs behaves. Does this address the comment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added commit test(config): Add tests for VpcExpose validation with port ranges, I think it should address your concern (except for fuzzing).

@mvachhar
Copy link
Contributor

Great initial cut, but I think it still needs some work.

In preparation for Port Address Translation in stateless NAT, add new
structs alongside the Prefix defined in crate lpm. Namely, we add:

- PortRange: a port range
- PrefixWithPorts: a Prefix, with an associated PortRange
- PrefixWithOptionalPorts: a Prefix, with an optional PortRange
- trait IpRangeWithPorts: a trait for common methods for structs
  implementing ranges of IPs (such as a Prefix) and ports. It makes the
  interface for such structs more consistent, and avoids duplicating the
  size() method.

These structs will be used in a follow-up commit.

Signed-off-by: Quentin Monnet <[email protected]>
Prepare for Port Address Translation in NAT. This commit is based on a
simple change. The following struct:

     pub struct VpcExpose {
         pub ips: BTreeSet<Prefix>,
         pub nots: BTreeSet<Prefix>,
         pub nat: Option<VpcExposeNat>,
     }

becomes:

     pub struct VpcExpose {
         pub ips: BTreeSet<PrefixWithOptionalPorts>,
         pub nots: BTreeSet<PrefixWithOptionalPorts>,
         pub nat: Option<VpcExposeNat>,
     }

In other words, we attach an optional port range to each prefix in the
various prefix lists of a VpcExpose object. Such objects are used in
several locations in the code, and as a consequence, we need to update
all of these. This includes:

- The gRPC/K8s conversion methods in crate config
- The validation steps for VpcExpose objects in crate config
- The "prefix collapse" utilities in crate config
- Building route advertisement in crate mgmt
- The stateful NAT allocator in crate nat
- The stateless NAT context build in crate nat
- The destination VPC lookup stage in crate pkt-meta

WARNING: These locations get trivial updates to make the code compile.
The surrounding logic is not updated to account for port ranges in this
commit. It will be updated, where relevant, in follow-up commits.

Signed-off-by: Quentin Monnet <[email protected]>
As part of this, we refactor validation to some extent, in particular
for prefixes overlapping detection, making it easier to follow at the
expense of more lines of code.

Signed-off-by: Quentin Monnet <[email protected]>
Now that VpcExpose accepts optional port ranges associated with IP
prefixes, and that we've updated the VpcExpose validation steps to
account for these port ranges, let's add some unit tests to make sure
validation behaves as we expect.

Signed-off-by: Quentin Monnet <[email protected]>
Introduce a new struct DisjointRangesBTreeMap, containing a generic
BTreeMap.

This struct is meant to be used with map keys representing
non-overlapping ranges, for example: disjoint port ranges or disjoint IP
prefixes.

The idea is the following: we can store IP ranges, for example, in a
DisjointRangesBTreeMap object, and look up for a single IP address, to
find the corresponding map key (the IP range) and the associated value,
whatever it is. The struct implements this lookup, so that we don't have
to re-implement it each time we want to use a similar BTreeMap with
different element types. The way we perform this lookup is by retrieving
the closest lower map key (IP range) matching the IP address, using
methods provided by the BTreeMap.

We already use similar structs in several locations in the code, and we
used to have it for stateless NAT in the past (before we switched to a
LPM). We also plan on adding a new one for the destination VPC lookup
discriminant stage, in a future commit. Introducing a
DisjointRangesBTreeMap should help harmonise these, although we don't
convert the existing similar structs at the moment.

Notes:

- As for the location of this struct: should it be in crate lpm? It's
  not directly related to the LPM implementation we have, but I didn't
  find a location that would really be better suited, and I don't think
  it deserves a dedicated crate. The crate also hosts the implementation
  for the Prefix structs, which are not completely tied to the LPM,
  either.

- Maybe re-implementing many methods (iter(), range(), ...) for the
  DisjointRangesBTreeMap, just piggybacking on the methods for the inner
  BTreeMap, might be overly verbose. If it's too much we may change in
  the future to expose the inner BTreeMap, and use the methods from
  there, although I'd rather keep it encapsulated if possible.

Signed-off-by: Quentin Monnet <[email protected]>
In preparation for supporting port ranges in "expose" objects, for Port
Address Translation, adjust the destination VPC discriminant lookup
stage.

So far, for a given packet, the source VPC and the destination VPC would
usually provide the destination VPC. When we add port ranges to
VpcExpose objects, we may have two packets from the same VPC, with the
same destination IP, but with L4 destination port from different ranges,
maybe associated to distinct peerings corresponding with distinct VPCs.
In this case, the source VPC and destination address are no longer
enough to determine the destination VPC: we must take the destination
port into account, too. This commit adjust the VpcDiscriminantTable and
related lookup accordingly.

As part of this, we re-implement the lookup() and insert() method for
the VpcDiscriminantTable itself, in order to mask the additional
complexity when working with the inner IpPrefixTrie.

Signed-off-by: Quentin Monnet <[email protected]>
"Collapsing" prefixes, in the context of this commit, comes down to
"applying" exclusion prefixes to a set of prefixes, by splitting the
latter set into subsets containing all the initial addresses, minus
those from the exclusion prefixes. This is helpful for NAT, because
after we've collapsed prefixes, we don't need to work with exclusion
prefixes anymore.

In preparation for Port Address Translation, we've updated VpcExpose to
associate optional port ranges to IP prefixes. As a result, we now need
to account for port ranges when "applying" exclusion
prefixes-with-port-ranges to prefixes-with-port-ranges. This is not
trivial to do, but with a bit of refactoring, it is not as scary as it
sounds.

Signed-off-by: Quentin Monnet <[email protected]>
Add unit tests for methods subtract() and intersection() for structs
PrefixWithPorts and PrefixWithOptionalPorts, to validate the logic.

Tests drafted with Claude code, then validated and refined by hand.

Signed-off-by: Quentin Monnet <[email protected]>
To test the updated prefix collapsing implementation, which now accounts
for port ranges alongside IP prefixes, add a bunch of unit tests.

Tests drafted by Claude Code - on the model of the existing tests that
don't take port ranges into account - then adjusted and validated by
yours truly.

Signed-off-by: Quentin Monnet <[email protected]>
This is some simple renaming, in preparation for more changes in the
code for the stateless NAT tables.

Struct TrieRange will get an alternative struct with port ranges, soon.
Make sure we can distinguish the two by renaming the existing one as
what it is: an IP address range.

Once we've renamed it, it no longer makes sense to call it's length
method ip_len(). Instead, just make it len().

Signed-off-by: Quentin Monnet <[email protected]>
Working towards support for Port Address Translation in stateless NAT,
update the structure of the stateless NAT tables to account for port
ranges: we turn NatTableValue into an enum with variants
AddrTranslationValue (IP range only) and PortAddrTranslationValue (IP
and port ranges). We introduce the IpPortRange type, similar to IpRange
but with an associated port range. It differs from PrefixWithPorts from
lpm crate in that the IP range may not be a CIDR prefix.

As for the map lookup, we used to do a simple LPM lookup to find the IP
range (LPM key) corresponding to a given IP address, we'd retrieve
associated target prefixes (LPM value) and do the mapping. IP ranges in
the LPM never overlapped. With ports, it's a bit different: we can have
overlapping IP ranges (as long as their respective associated port
ranges make them disjoint). As a consequence, the longest matching IP
prefix may no longer be the one matching the IP _and the port_ from the
packet. Instead, we need to iterate over _all matching IP prefixes_ to
find the one with the correct associated port. See method
matching_entries() added to the LPM.

Adjustments for building the table will come in a follow-up commit.

Signed-off-by: Quentin Monnet <[email protected]>
Start Bolero tests update for testing the build of the context for
stateless NAT, although we don't go as far as to add actual port ranges
as part of the validation at this time.

Signed-off-by: Quentin Monnet <[email protected]>
Now that stateless NAT tables support IP prefixes with associated port
ranges, update the RangeBuilder struct and its implementation to build
the stateless NAT tables accordingly. It's a bit messy, but in the end
we get there.

Note: we convert back the generated PortAddrTranslationValue to a
simpler AddrTranslationValue when possible. This allows simpler (and
more efficient) processing when port ranges are not required. It may be
more correct to convert back _only_ when port ranges are not specified
at all in the config for the relevant VpcExpose. But we'd need a pass
before building the tables, to determine whether any of the prefixes in
the lists of the VpcExpose uses ports (at the moment, we only have this
information for the current prefix being processed, not for the other
original prefixes). We may also want to optimise the RangeBuilder if we
know there is no port range for a given Expose; although it's not
critical, as this code only runs at config build time.

Signed-off-by: Quentin Monnet <[email protected]>
Add unit tests to validate some of the functions used when building the
context for stateless NAT. There is no test for port range support in
RangeBuilder at this point, only for the functions it uses to build the
ranges.

Signed-off-by: Quentin Monnet <[email protected]>
After the various complex commits that ensued from the addition of port
ranges to VpcExpose objects, this commit looks pretty straightforward.
If we get a port mapping from the stateless NAT tables lookup, update
the source and/or destination L4 port accordingly for the packet. Port
update is a bit messy because we need to process each protocol
separately; we might want to refactor this in the future.

Signed-off-by: Quentin Monnet <[email protected]>
The idea behind the stateless NAT lookup is that, for a given IP and
port, we find the matching covering prefix (and associated port range)
in a LPM trie, and get the list of corresponding target ranges from the
LPM value. Then we retrieve the offset of the IP and port within the
prefix (LPM key), and look for the corresponding IP and port within the
target prefixes (LPM value).

To find the IP and port at the right offset, we need to iterate over all
target prefixes in the list and to compare the accumulated size with the
offset. This is not efficient; instead, we can use another data
structure to perform a faster lookup.

To that end, replace the list of prefixes with a DisjointRangesBTreeMap
that associates a target prefix to each portion of the original prefix.

Suggested-by: Manish Vachharajani <[email protected]>
Signed-off-by: Quentin Monnet <[email protected]>
Make sure translation for stateless PAT works as "expected": we rely on
the implementation using ordered and sorted prefixes to compute the
expected matching IP/port for some IP/port combinations.

Signed-off-by: Quentin Monnet <[email protected]>
Using an Option<PortRange> to make the port range optional may lead to
confusion, in particular when we attempt to get the .size() of the
object. It's cleaner to use an enum instead, which can take either a
Prefix or a PrefixWithPorts (reusing the latter's methods to implement
equivalent methods for PrefixWithOptionalPorts) - even though it's a bit
more verbose overall.

Suggested-by: Manish Vachharajani <[email protected]>
Signed-off-by: Quentin Monnet <[email protected]>
@qmonnet
Copy link
Member Author

qmonnet commented Dec 18, 2025

@mvachhar Your feedback should be mostly addressed, please take another look when you can.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/nat Related to Network Address Translation (NAT) ci:-upgrade Disable VLAB upgrade tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Stateless NAT, Port Address Translation: implement translation Stateless NAT, Port Address Translation: build context from API

2 participants