-
Notifications
You must be signed in to change notification settings - Fork 6
Port Address Translation (stateless) #1137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
a91378a to
7ddd0b8
Compare
6b5ba67 to
a3c771f
Compare
|
I think it's ready for a first pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements stateless Port Address Translation (PAT) for the dataplane, adding support for port ranges in NAT operations. The implementation extends the existing prefix-based NAT system to handle combinations of IP prefixes and port ranges, enabling fine-grained control over which ports are translated.
Key Changes:
- Introduced
PrefixWithOptionalPortsandPortRangetypes to represent IP prefixes with associated port ranges - Extended NAT tables to support both address-only translation (NAT) and port+address translation (PAT)
- Updated lookup and translation logic throughout the dataplane to consider port ranges
Reviewed changes
Copilot reviewed 27 out of 28 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
lpm/src/prefix/with_ports.rs |
New module defining port range types and IP prefix/port combinations |
lpm/src/prefix/range_map.rs |
New DisjointRangesBTreeMap for efficient range-based lookups |
nat/src/stateless/setup/tables.rs |
Extended NAT tables with PAT support, including PortAddrTranslationValue |
nat/src/stateless/setup/range_builder.rs |
Updated range builder to handle port ranges in offset calculations |
nat/src/stateless/mod.rs |
Added port translation methods and integrated port lookup |
pkt-meta/src/dst_vpcd_lookup/mod.rs |
Updated destination VPC lookup to consider port ranges |
config/src/utils/collapse.rs |
Simplified prefix collapsing to use new port range subtraction logic |
config/src/external/overlay/vpcpeering.rs |
Updated VpcExpose validation for port-aware prefix overlapping |
nat/src/stateless/test.rs |
Added comprehensive tests for PAT functionality |
lpm/src/trie/*.rs |
Added matching_entries method for finding all matching prefixes |
| /// If the port range is empty, this method should return the number of addresses in the address | ||
| /// range: so this will produce a different value than an object with the same IP range, but | ||
| /// with a port range covering all existing ports. | ||
| fn size(&self) -> PrefixWithPortsSize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why make this semantic distinction? is it just to allow compatibility between an ip range and an ip range with ports?
In other words, if this trait is called IpRangeWithPorts, then should it always count (Ip, port) pairs available, so no port ranges means the size is the same as all port ranges?
If we what this to mean something like IpRangeOrIpRangeWithPorts (terrible name, we should choose something shorter), then the better way to say this is that it returns the available size of items. For an IpRange that would be the number of IPs for an IPPortRange, that would be a different thing entirely.
Perhaps a discussion on the correct design here is in order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is in fact outdated. I had a previous version of the code that made a distinction, but it's been changed a while ago. I'll remove the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment removed
| @@ -0,0 +1,299 @@ | |||
| // SPDX-License-Identifier: Apache-2.0 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where are the tests for this new functionality? I prefer fuzz tests if they are easy, but there should at least be some tests right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I see what functionality you refer to, in particular. There are some unit tests in follow-up commits after the introduction of these structs and methods.
Fuzzing is tricky, most of the interesting methods to fuzz require re-implementing the logic in the fuzzer to validate correctness. I don't have any fuzzing tests in this PR, we could look into it but it would probably be for a follow-up.
| #[derive(Debug, Clone)] | ||
| enum VpcDiscriminantTableValue { | ||
| PortRangeTable(DisjointRangesBTreeMap<PortRange, Option<VpcDiscriminant>>), | ||
| NoPorts(Option<VpcDiscriminant>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I like this enumeration approach better for the IpRangeWithPorts trait as well instead of using an empty port range to mean IPs only. It might clean up my earlier comment about the size method as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, does the order here matter? Do we need a BTreeMap or would a HashMap suffice? For small overlaps though BTreeMap might be better. Just thought I'd point out the tradeoff. What do we expect for size here if people are mapping port ranges from one ISP provided IP to many internal hosts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I like this enumeration approach better for the IpRangeWithPorts trait as well instead of using an empty port range to mean IPs only. It might clean up my earlier comment about the size method as well.
I'll see if I can improve it.
Also, does the order here matter? Do we need a BTreeMap or would a HashMap suffice? For small overlaps though BTreeMap might be better. Just thought I'd point out the tradeoff.
I don't think a HashMap would do. For a given IP address, I don't know how to find the matching prefix used as a key in the HashMap without doing a linear search on the entries; for an ordered BTreeMap, I can do that.
What do we expect for size here if people are mapping port ranges from one ISP provided IP to many internal hosts?
One BTreeMap per port range, with the associated IPs. You need to store these mappings one way or another anyway, unless you find a way to “compress” it the difference is the overhead related to the data structure in use, I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a commit at the end to use an enum for PrefixWithOptionalPorts, I think this is what you meant (but I'm not 100% sure). Please take a look at it.
refactor(lpm): Use enum for PrefixWithOptionalPorts
|
|
||
| #[cfg(test)] | ||
| mod tests { | ||
| use super::*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so now we have some tests, but still missing the basic tests for overlap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What tests are you referring to, exactly?
As discussed during our call, I'll add some tests to see how excluding a range of ports for the full range of IPs behaves. Does this address the comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added commit test(config): Add tests for VpcExpose validation with port ranges, I think it should address your concern (except for fuzzing).
|
Great initial cut, but I think it still needs some work. |
a3c771f to
05d9959
Compare
In preparation for Port Address Translation in stateless NAT, add new structs alongside the Prefix defined in crate lpm. Namely, we add: - PortRange: a port range - PrefixWithPorts: a Prefix, with an associated PortRange - PrefixWithOptionalPorts: a Prefix, with an optional PortRange - trait IpRangeWithPorts: a trait for common methods for structs implementing ranges of IPs (such as a Prefix) and ports. It makes the interface for such structs more consistent, and avoids duplicating the size() method. These structs will be used in a follow-up commit. Signed-off-by: Quentin Monnet <[email protected]>
Prepare for Port Address Translation in NAT. This commit is based on a
simple change. The following struct:
pub struct VpcExpose {
pub ips: BTreeSet<Prefix>,
pub nots: BTreeSet<Prefix>,
pub nat: Option<VpcExposeNat>,
}
becomes:
pub struct VpcExpose {
pub ips: BTreeSet<PrefixWithOptionalPorts>,
pub nots: BTreeSet<PrefixWithOptionalPorts>,
pub nat: Option<VpcExposeNat>,
}
In other words, we attach an optional port range to each prefix in the
various prefix lists of a VpcExpose object. Such objects are used in
several locations in the code, and as a consequence, we need to update
all of these. This includes:
- The gRPC/K8s conversion methods in crate config
- The validation steps for VpcExpose objects in crate config
- The "prefix collapse" utilities in crate config
- Building route advertisement in crate mgmt
- The stateful NAT allocator in crate nat
- The stateless NAT context build in crate nat
- The destination VPC lookup stage in crate pkt-meta
WARNING: These locations get trivial updates to make the code compile.
The surrounding logic is not updated to account for port ranges in this
commit. It will be updated, where relevant, in follow-up commits.
Signed-off-by: Quentin Monnet <[email protected]>
As part of this, we refactor validation to some extent, in particular for prefixes overlapping detection, making it easier to follow at the expense of more lines of code. Signed-off-by: Quentin Monnet <[email protected]>
Now that VpcExpose accepts optional port ranges associated with IP prefixes, and that we've updated the VpcExpose validation steps to account for these port ranges, let's add some unit tests to make sure validation behaves as we expect. Signed-off-by: Quentin Monnet <[email protected]>
Introduce a new struct DisjointRangesBTreeMap, containing a generic BTreeMap. This struct is meant to be used with map keys representing non-overlapping ranges, for example: disjoint port ranges or disjoint IP prefixes. The idea is the following: we can store IP ranges, for example, in a DisjointRangesBTreeMap object, and look up for a single IP address, to find the corresponding map key (the IP range) and the associated value, whatever it is. The struct implements this lookup, so that we don't have to re-implement it each time we want to use a similar BTreeMap with different element types. The way we perform this lookup is by retrieving the closest lower map key (IP range) matching the IP address, using methods provided by the BTreeMap. We already use similar structs in several locations in the code, and we used to have it for stateless NAT in the past (before we switched to a LPM). We also plan on adding a new one for the destination VPC lookup discriminant stage, in a future commit. Introducing a DisjointRangesBTreeMap should help harmonise these, although we don't convert the existing similar structs at the moment. Notes: - As for the location of this struct: should it be in crate lpm? It's not directly related to the LPM implementation we have, but I didn't find a location that would really be better suited, and I don't think it deserves a dedicated crate. The crate also hosts the implementation for the Prefix structs, which are not completely tied to the LPM, either. - Maybe re-implementing many methods (iter(), range(), ...) for the DisjointRangesBTreeMap, just piggybacking on the methods for the inner BTreeMap, might be overly verbose. If it's too much we may change in the future to expose the inner BTreeMap, and use the methods from there, although I'd rather keep it encapsulated if possible. Signed-off-by: Quentin Monnet <[email protected]>
In preparation for supporting port ranges in "expose" objects, for Port Address Translation, adjust the destination VPC discriminant lookup stage. So far, for a given packet, the source VPC and the destination VPC would usually provide the destination VPC. When we add port ranges to VpcExpose objects, we may have two packets from the same VPC, with the same destination IP, but with L4 destination port from different ranges, maybe associated to distinct peerings corresponding with distinct VPCs. In this case, the source VPC and destination address are no longer enough to determine the destination VPC: we must take the destination port into account, too. This commit adjust the VpcDiscriminantTable and related lookup accordingly. As part of this, we re-implement the lookup() and insert() method for the VpcDiscriminantTable itself, in order to mask the additional complexity when working with the inner IpPrefixTrie. Signed-off-by: Quentin Monnet <[email protected]>
"Collapsing" prefixes, in the context of this commit, comes down to "applying" exclusion prefixes to a set of prefixes, by splitting the latter set into subsets containing all the initial addresses, minus those from the exclusion prefixes. This is helpful for NAT, because after we've collapsed prefixes, we don't need to work with exclusion prefixes anymore. In preparation for Port Address Translation, we've updated VpcExpose to associate optional port ranges to IP prefixes. As a result, we now need to account for port ranges when "applying" exclusion prefixes-with-port-ranges to prefixes-with-port-ranges. This is not trivial to do, but with a bit of refactoring, it is not as scary as it sounds. Signed-off-by: Quentin Monnet <[email protected]>
Add unit tests for methods subtract() and intersection() for structs PrefixWithPorts and PrefixWithOptionalPorts, to validate the logic. Tests drafted with Claude code, then validated and refined by hand. Signed-off-by: Quentin Monnet <[email protected]>
To test the updated prefix collapsing implementation, which now accounts for port ranges alongside IP prefixes, add a bunch of unit tests. Tests drafted by Claude Code - on the model of the existing tests that don't take port ranges into account - then adjusted and validated by yours truly. Signed-off-by: Quentin Monnet <[email protected]>
This is some simple renaming, in preparation for more changes in the code for the stateless NAT tables. Struct TrieRange will get an alternative struct with port ranges, soon. Make sure we can distinguish the two by renaming the existing one as what it is: an IP address range. Once we've renamed it, it no longer makes sense to call it's length method ip_len(). Instead, just make it len(). Signed-off-by: Quentin Monnet <[email protected]>
Working towards support for Port Address Translation in stateless NAT, update the structure of the stateless NAT tables to account for port ranges: we turn NatTableValue into an enum with variants AddrTranslationValue (IP range only) and PortAddrTranslationValue (IP and port ranges). We introduce the IpPortRange type, similar to IpRange but with an associated port range. It differs from PrefixWithPorts from lpm crate in that the IP range may not be a CIDR prefix. As for the map lookup, we used to do a simple LPM lookup to find the IP range (LPM key) corresponding to a given IP address, we'd retrieve associated target prefixes (LPM value) and do the mapping. IP ranges in the LPM never overlapped. With ports, it's a bit different: we can have overlapping IP ranges (as long as their respective associated port ranges make them disjoint). As a consequence, the longest matching IP prefix may no longer be the one matching the IP _and the port_ from the packet. Instead, we need to iterate over _all matching IP prefixes_ to find the one with the correct associated port. See method matching_entries() added to the LPM. Adjustments for building the table will come in a follow-up commit. Signed-off-by: Quentin Monnet <[email protected]>
Start Bolero tests update for testing the build of the context for stateless NAT, although we don't go as far as to add actual port ranges as part of the validation at this time. Signed-off-by: Quentin Monnet <[email protected]>
Now that stateless NAT tables support IP prefixes with associated port ranges, update the RangeBuilder struct and its implementation to build the stateless NAT tables accordingly. It's a bit messy, but in the end we get there. Note: we convert back the generated PortAddrTranslationValue to a simpler AddrTranslationValue when possible. This allows simpler (and more efficient) processing when port ranges are not required. It may be more correct to convert back _only_ when port ranges are not specified at all in the config for the relevant VpcExpose. But we'd need a pass before building the tables, to determine whether any of the prefixes in the lists of the VpcExpose uses ports (at the moment, we only have this information for the current prefix being processed, not for the other original prefixes). We may also want to optimise the RangeBuilder if we know there is no port range for a given Expose; although it's not critical, as this code only runs at config build time. Signed-off-by: Quentin Monnet <[email protected]>
Add unit tests to validate some of the functions used when building the context for stateless NAT. There is no test for port range support in RangeBuilder at this point, only for the functions it uses to build the ranges. Signed-off-by: Quentin Monnet <[email protected]>
After the various complex commits that ensued from the addition of port ranges to VpcExpose objects, this commit looks pretty straightforward. If we get a port mapping from the stateless NAT tables lookup, update the source and/or destination L4 port accordingly for the packet. Port update is a bit messy because we need to process each protocol separately; we might want to refactor this in the future. Signed-off-by: Quentin Monnet <[email protected]>
The idea behind the stateless NAT lookup is that, for a given IP and port, we find the matching covering prefix (and associated port range) in a LPM trie, and get the list of corresponding target ranges from the LPM value. Then we retrieve the offset of the IP and port within the prefix (LPM key), and look for the corresponding IP and port within the target prefixes (LPM value). To find the IP and port at the right offset, we need to iterate over all target prefixes in the list and to compare the accumulated size with the offset. This is not efficient; instead, we can use another data structure to perform a faster lookup. To that end, replace the list of prefixes with a DisjointRangesBTreeMap that associates a target prefix to each portion of the original prefix. Suggested-by: Manish Vachharajani <[email protected]> Signed-off-by: Quentin Monnet <[email protected]>
Make sure translation for stateless PAT works as "expected": we rely on the implementation using ordered and sorted prefixes to compute the expected matching IP/port for some IP/port combinations. Signed-off-by: Quentin Monnet <[email protected]>
Using an Option<PortRange> to make the port range optional may lead to confusion, in particular when we attempt to get the .size() of the object. It's cleaner to use an enum instead, which can take either a Prefix or a PrefixWithPorts (reusing the latter's methods to implement equivalent methods for PrefixWithOptionalPorts) - even though it's a bit more verbose overall. Suggested-by: Manish Vachharajani <[email protected]> Signed-off-by: Quentin Monnet <[email protected]>
05d9959 to
c0f0db9
Compare
|
@mvachhar Your feedback should be mostly addressed, please take another look when you can. |
Stateless Port and Address Translation (PAT) for the dataplane.
Best reviewed per-commit, but hold on tight, this PR is dense.
Follow-ups include: