Skip to content

DNS servers should have NS and SOA records #8047

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 45 commits into
base: main
Choose a base branch
from
Open

Conversation

iximeow
Copy link
Member

@iximeow iximeow commented Apr 24, 2025

this is probably the more exciting part of the issues outlined in #6944. the changes here get us to the point that for both internal and external DNS, we have:

  • A/AAAA records for the DNS servers in the internal/external group (named ns1.<zone>, ns2.<zone>, ...)
  • NS records for those servers at the zone apex, one for each of the ns*.<zone> described above
  • an SOA record synthesized on-demand for the zone apex for each of oxide.internal (for internal DNS) and $delegated_domain (for external DNS)
  • the SOA's serial is updated whenever the zone is changed. serial numbers are effectively the DNS config generation, so they start from 1 and tick upward with each change. this is different from most SOA serial schemes (in particular the ones that would use YYYYMMDDNN numbering schemes) but so far as i can tell this is consistent with RFC 1035 requirements.

we do not support zone transfers here. i believe the SOA record here would be reasonable to guide zone transfers if we did, but obviously that's not something i've tested.

SOA fields

the SOA record's RNAME is hardcoded to admin@<zone_name>. this is out of expediency to provide something, but it's probably wrong most of the time. there's no way to get an MX record installed for <zone_name> in the rack's external DNS servers, so barring DNS hijinks in the deployed environment, this will be a dead address. problems here are:

  • we would want to take in an administrative email at rack setup time, so that would be minor plumbing
  • more importantly, what to backfill this with for deployed systems?

it seems like the best answer here is to allow configuration of the rack's delegated domain and zone after initial setup, and being able to update an administrative email would fit in pretty naturally there. but we don't have that right now, so admin@ it is. configuration of external DNS is probably more important in the context of zone transfers and permitting a list of remote addresses to whom we're willing to permit zone transfers. so it feels like this is in the API's future at some point.

bonus

one minorly interesting observation along the way is that external DNS servers in particular are reachable at a few addresses - whichever public address they get in the rack's internal address range, and whichever address they get in the external address range. the public address is what's used for A/AAAA records. so, if you're looking around from inside a DNS zone you can get odd-looking answers like:

# 172.30.1.5 is the internal address that an external DNS server is bound to.
# oxide.test is the delegated domain for this local Omicron deployment.
root@oxz_external_dns_68c5e255:~# dig +short ns2.oxide.test @172.30.1.5
192.168.0.161
root@oxz_external_dns_68c5e255:~# dig +short soa oxide.test @172.30.1.5
ns1.oxide.test. admin.oxide.test. 2 3600 600 18000 150
root@oxz_external_dns_68c5e255:~# dig +short ns oxide.test @172.30.1.5
ns1.oxide.test.
ns2.oxide.test.
# 192.168.0.160 is an external address for this same server.
# there are no records referencing 172.30.1.5 here.
root@oxz_external_dns_68c5e255:~# dig +short ns oxide.test @192.168.0.160
ns1.oxide.test.
ns2.oxide.test.
root@oxz_external_dns_68c5e255:~# dig +short ns1.oxide.test @192.168.0.160
192.168.0.160

@iximeow iximeow added the release notes reminder to include this in the release notes label Apr 24, 2025
@iximeow iximeow force-pushed the ixi/dns-ns-and-soa branch 2 times, most recently from 842455b to f349290 Compare April 25, 2025 21:50
@iximeow iximeow force-pushed the ixi/dns-ns-and-soa branch from f349290 to fa47ab1 Compare April 25, 2025 22:08
Comment on lines +174 to +207
impl From<Srv> for DnsRecord {
fn from(srv: Srv) -> Self {
DnsRecord::Srv(srv)
}
}

#[derive(
Clone,
Debug,
Serialize,
Deserialize,
JsonSchema,
PartialEq,
Eq,
PartialOrd,
Ord,
)]
pub struct Srv {
pub prio: u16,
pub weight: u16,
pub port: u16,
pub target: String,
}

impl From<v1::config::Srv> for Srv {
fn from(other: v1::config::Srv) -> Self {
Srv {
prio: other.prio,
weight: other.weight,
port: other.port,
target: other.target,
}
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the other option here is to use the v1::config::Srv type directly in v2, because it really has not changed. weaving the V1/V2 types together seems more difficult to think about generally, but i'm very open to the duplication being more confusing if folks feel that way.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably use the v1 types directly but I can see going either way.

@iximeow iximeow marked this pull request as ready for review May 1, 2025 21:58
Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't had a chance to look at this closely.

Comment on lines +174 to +207
impl From<Srv> for DnsRecord {
fn from(srv: Srv) -> Self {
DnsRecord::Srv(srv)
}
}

#[derive(
Clone,
Debug,
Serialize,
Deserialize,
JsonSchema,
PartialEq,
Eq,
PartialOrd,
Ord,
)]
pub struct Srv {
pub prio: u16,
pub weight: u16,
pub port: u16,
pub target: String,
}

impl From<v1::config::Srv> for Srv {
fn from(other: v1::config::Srv) -> Self {
Srv {
prio: other.prio,
weight: other.weight,
port: other.port,
target: other.target,
}
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably use the v1 types directly but I can see going either way.

@@ -284,30 +335,38 @@ async fn handle_dns_message(
(RecordType::A, DnsRecord::A(_)) => true,
(RecordType::AAAA, DnsRecord::Aaaa(_)) => true,
(RecordType::SRV, DnsRecord::Srv(_)) => true,
(RecordType::NS, DnsRecord::Ns(_)) => true,
(RecordType::SOA, DnsRecord::Soa(_)) => true,
Copy link
Member Author

@iximeow iximeow May 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one interesting detail here: RFC 1034 describes negative caching as returning an SOA in an authoritative name error, where the minimum field of the SOA is used as the TTL for the negative result. that's probably one of the more common ways you'd run into an SOA record if you're not operating nameservers yourself. this only returns an SOA record if you've queried about an SOA record.

in the negative caching case we'd want to return this SOA record in the soa* list provided to MessageResponseBuilder. somewhat different kind of plumbing, but it'd be relatively straightforward after this.

* technically RFC 1034 says "may add an SOA RR to the additional [...]", but it's wrong. this is clarified in 2181 7.1.

Comment on lines 363 to 365
// authoritative for zones. SOA records are in the API types here
// because they are included when reporting this server's records (such
// as via `dns_config_get()`).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean you cannot GET the config and PUT it right back? That doesn't seem right. (edit: more below.)

//
// Assuming one generation bump every minute, this overflow
// would affect operations after 8,171 years.
let soa_serial = config.generation.as_u64() as u32;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great comment, but why not use a try_from anyway and produce a 500 if this happens?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

once the generation goes above u32::MAX we would always 500 on any DNS update (until the generation itself wraps u64). there's nothing that would reset the DNS generation after initial rack setup, right? so even attempting to update DNS records as part of an update would fail - it'd be pretty difficult for a customer to get out of that situation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read the comment to be arguing "this is never going to happen in practice", which I agree with. If somehow it did happen, I'd rather it fail explicitly and in a way that's debuggable (e.g., an explicit error in a log somewhere) rather than implicitly and non-fatally (e.g., serial number goes backwards; downstream clients fail to see any subsequent changes).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so.. i'd think it's very unlikely to happen in practice, but not impossible - if there was haywire automation against a rack that created and destroyed silos in a loop, the threshold here is more like.. "happens in 25 years"? i don't think there's an operation that would increment the generation number in under ~200ms, so (mostly joking) it almost makes sense to return a 500 if the year is lower than 2045 and this condition occurs, roll over otherwise 🙃

i otherwise agree that failing explicitly would be better to discover and debug. logging every time the rollover occurs (e.g. config.generation.as_u64() > 0 && config.generation.as_u64() as u32 == 0) would be a little better, what do you think? am i weighting the haywire automation risk too high?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid that a log message would be very easy to miss. What I'm thinking with the 500 is: someone will notice immediately when this happens and they can follow the failing request directly to the problem.

That said: maybe the better thing here is to change the OpenAPI spec so that the generation is a u32. Most importantly:

  • we'd validate that it's in range during PUT/GET
  • code using the Progenitor client couldn't possibly set a value out of range

This should force the range problem to be dealt with much earlier, at the point where we try to bump the DNS generation number in the first place. That code already has to deal with a range error (whether it does or not, I don't know) -- we're just shrinking that range. That'll be in Nexus, too, so we'll be in a better position to surface that as a fault, once we have richer support for fault reporting.

We could do this in a separate PR. That might also need to change the column of the dns_version table -- I'm not sure. Normally I'd be worried about this sort of change breaking deployed systems, but I really don't think any deployed systems have values large enough for this to be a problem and I think we could just make this change without worrying about it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also don't need to block this PR on fixing that. For example, you could do whichever of these in this PR and fix the API in a follow-on PR.

If you do go with rollover, though, beware that the DNS server might not see every generation number so if you just check for generation as u32 == 0, you would miss some rollovers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nexus is definitely where i'd prefer handling the generation number overflowing too, i like that thought a lot. i'll go with making this a 500 in the DNS server in the mean time. that error path will either go away or move to Nexus in the follow up so it wouldn't be a 30-to-8,171-year timebomb in the same way.

// Assuming one generation bump every minute, this overflow
// would affect operations after 8,171 years.
let soa_serial = config.generation.as_u64() as u32;
apex_records.push(DnsRecord::Soa(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see -- so, why store this at all in the database? What about synthesizing it when we load the config from the database? That will avoid the problem I mentioned above (round-trip GET + PUT doesn't work) and could also future-proof us if we decide to change anything about this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when you say "store this at all in the database", you mean, why store the SOA record on disk at all? i do like that on the basis of avoiding the "round-trip doesn't work" problem you noted. though in the same way i don't love the SOA record not existing in CRDB, i don't love the SOA record being entirely synthetic. it's invisible to omdb, it'd be invisible to dnsadm. if support bundles included DNS records, SOA would still be absent!

maybe the thing to do here is create an SOA if one isn't present, and if one is present just replace the serial with config.generation. that corrects the round-trip-doesn't-work issue and means we can debug with more custom SOA records if server's defaults are no good for some environment.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I get into the PR, the more strongly I feel like we shouldn't store the SOA record into the DNS database (or, at least, shouldn't expose it back out the API).

though in the same way i don't love the SOA record not existing in CRDB, i don't love the SOA record being entirely synthetic. it's invisible to omdb, it'd be invisible to dnsadm. if support bundles included DNS records, SOA would still be absent!

All things being equal, I agree. But there's no reason omdb and support bundles can't also report the entire DNS contents (fetched via DNS queries) -- that'd be a better representation of what clients see anyway. So I don't think that should be a constraint on the solution here.

maybe the thing to do here is create an SOA if one isn't present, and if one is present just replace the serial with config.generation. that corrects the round-trip-doesn't-work issue and means we can debug with more custom SOA records if server's defaults are no good for some environment.

This has the opposite round-trip problem: Nexus might PUT configuration C1 with generation number g1, but if it GETs it back, it will find a different configuration C2 with generation number g1. That's a problem on its own (it violates the constraint that if the DnsConfig changes, the generation number must change). (And the DNS server can't bump g1 because it doesn't own that -- that would fork the linear history.) This also introduces a discrepancy between what's in CockroachDB and what's in the DNS config. The database wouldn't have the SOA record, but the actual config would. Worse, if Nexus then wants to create configuration C3 (immediately following C1, as far as it's concerned) with generation number g2, how would the DNS server interpret receiving a request that, relative to the previous generation, is missing the SOA? Is Nexus trying to remove it altogether or did it just not know about it?


Stepping back (sorry I'm sure this restates stuff you know but it's been helpful for me to write this down), the way I'm thinking about this is: right now, there are 1-1 relationships between all of these:

  • the contents of the DNS-related database tables
  • the contents written to the DNS PUT API
  • the contents returned by the DNS GET API
  • the contents of the DNS server's database
  • the DNS records served by the DNS server

(Right? It's been a while since I've been in this code.)

and there are two other core design principles:

  • the abstraction provided by the DNS server is that it's configurable with a monotonically-increasing generation number -- it serves exactly what you configure it with; and
  • changes to the DNS config always originate in Nexus, flow to the database, then the DNS servers (via the PUT API), which update their local storage and the external API

All of this keeps things really simple: Nexus is always free to make any changes it likes, and all these flows are pretty trivial because they never do any conflict resolution, they never silently change any values, or ignore any values, etc -- they just take the data they're given and write to the next thing in the flow. This is important because in the end we have a lot of copies of this data across a few different representations, so if any layer is transforming it (semantically), it's a lot harder to reason about and be sure it's all working right.

It's problematic for any point in this flow to create new DNS data that gets exposed to earlier steps in the flow. You can sort of pick your poison but one way or another this gets really messy. We've seen that with the GET/PUT round-trip problem, the different PUT/GET round-trip problem, the problem where the DNS server can't distinguish between an intentional change made by Nexus vs. a case where Nexus just didn't know about some data that it had made up, etc.

I'd claim that it's fine though for any point in this process to make up DNS data that it doesn't expose to earlier in the process. That doesn't introduce any of these problems.


This might be confusing but I imagine us getting to this design like this:

  • we start by saying all of the config is specified by the DNS API client
  • then we say: we want to add SOA records, but we want the serial number to be filled in implicitly by the generation number because otherwise we're creating a huge pain for clients to try to manage this on their own
  • so you could imagine a design where the client does specify an SOA record, but it just doesn't have a serial_number field -- that's implicitly filled in
  • This is what I imagine we would do if we were fully fleshing this out, namely if we ever wanted the contents of the SOA to be configurable by Nexus.
  • But since we always want exactly one SOA and don't need its contents to be configurable, we didn't bother doing the work to plumb it into CockroachDB, the DNS API, or the DNS storage layer.

If that's too weird, I'd almost rather do the work to put the SOA sans serial number into all those places and have Nexus "fully manage it" (which just means making sure there's always exactly one). I think that'd be a sound approach, just more work than we need to do right now.


The last thing I wanted to mention is future-proofing. Right now, we don't allow the SOA contents to be configured at all and that's fine. If it's completely synthetic, then in the future, we could still allow parts of it to be configured by the API client. If not specified by the API client, we still know how to make up synthetic values. At that point, though, if we're also storing our made-up version in the database, then we won't be able to distinguish an SOA configured by the client from one synthesized by the server. That kind of sucks if we decide later to, say, change the synthesized contents. This might all sound contrived but all I mean to say is that by storing this into the DNS database, we're losing the ability to distinguish "this was written by a client who cares about these values" vs. "the client doesn't care and we should use our defaults [which could change over time]".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has the opposite round-trip problem: ...

yeah, that's a lot thornier than i'd initially considered. and

It's problematic for any point in this flow to create new DNS data that gets exposed to earlier steps in the flow

i hadn't really thought about it as a unidirectional flow, but i see the reasoning. so, i'm convinced that the SOA record really should not be returned through the DNS server's GET.

it probably doesn't make sense to synthesize a SOA record when reading the config, because GETing the config also reads the config in the same way, so we'd have to filter the SOA back out there. i think it works out well as a case in query_raw, where if you query for records at the apex we'll create an additional SOA record on top of whatever's defined in the database.

This might all sound contrived

less than you might think :) i've made basically these same arguments before, i had just assumed the only values we might want to configure here really are the administrative contact at some point in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, the SOA is now entirely a query-time construct now. no longer in the DNS server's HTTP API.

i still want to digest the overall information flow you described into a comment somewhere closer to the DNS server, because it's a good way to think about the layering and probably will inform me again in the future. but otherwise, i think the API and record data is in better shape. thank you!

Comment on lines +1052 to +1058
// We'll only have external DNS nameserver records - the A/AAAA records
// for servers themselves, and NS records at the apex.
let baseline_external_dns_names = external_dns_count + 1;
assert_eq!(
external_dns_zone.records.len(),
baseline_external_dns_names
);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm following this right, this is asserting that there are 4 external DNS zone records. I don't follow why that's right, though -- don't we have an A record and NS record for each of the 3 servers, so 6 records altogether?

Edit: Oh, is records.len() the number of different names that have any records at all? So it's: the apex (which has the 3 NS records all at one name) + each of the three servers (each with a different name)? If that's it maybe the comment could be a little more explicit. "Although there will be 2 records for each external DNS server, all the NS records will be under one name, so there are only N + 1 entries in the records map."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was pretty confused at first here too, for the same reason. i'd tried writing exactly that assert_eq!(...len(), 6) even! i'll change .records to .names in the v2 type and i think that will make it more legible.

@@ -206,20 +206,38 @@ impl super::Nexus {
);

let silo_name = &request.recovery_silo.silo_name;
let dns_records = request
// Records that should be present at the rack-internal zone apex -
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's important to know that this code path only affects the contents of the first internal DNS zone for newly-initialized racks. None of this will run for deployed racks. I assume that's fine?

(We probably should have an end-to-end test that the internal DNS contents doesn't change after the initial blueprint is executed. We have a similar end-to-end test that generating a new blueprint from the initial one makes no changes.)

Relatedly: should we be using blueprint_internal_dns_config() like you're doing with the external zone below?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of this will run for deployed racks. I assume that's fine?

yeah, the changes in rack.rs are really just in service of minimizing the delta between state just after RSS and what the initial blueprint describes. except that as you noted below, the records don't actually get added to the internal update, which is a bug.

Relatedly: should we be using blueprint_internal_dns_config() like you're doing with the external zone below?

probably! i didn't make that change here because the initial internal DNS records are created by sled-agent, and i'd want to either compare those more closely. or, have an end-to-end test like you mention that checks that the records as created by sled-agent are the same as the initial blueprint.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we be using blueprint_internal_dns_config() like you're doing with the external zone below?

in the intervening time i looked a bit more at sled-agent. this seems pretty doable, since sled-agent does produce a blueprint it sends off to Nexus in the handoff request, and it seems reasonable to lift that blueprint generation to earlier in the process and produce DNS records from it. unless you feel strongly about it, though, i'd like to do that as a separate change because this already has grown pretty substantially from what i'd expected to change at once!

.into_iter()
.map(|addr| match addr {
IpAddr::V4(addr) => DnsRecord::A(addr),
IpAddr::V6(addr) => DnsRecord::Aaaa(addr),
})
.collect();

let records = silos
let mut zone_records: Vec<DnsRecord> = Vec::new();
let external_dns_records: Vec<(String, Vec<DnsRecord>)> = dns_external_ips
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we sort these for consistency?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need to, so i'm not sure we'd want to. it seems a bit awkward that lower-numbered ns{} records would correlate with lower-number IP addresses, and that would mean that SOA records are biased towards the DNS server with the lowest IP.

maybe that's fine though?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing I'm trying to avoid is having the contents of NS records change because of some non-determinism here, causing either blueprint or DNS generation bumps under normal conditions even when there's been no changes. I think that's important to avoid but any deterministic order seems fine.

@@ -20,6 +20,17 @@ pub const DNS_ZONE: &str = "control-plane.oxide.internal";
/// development
pub const DNS_ZONE_EXTERNAL_TESTING: &str = "oxide-dev.test";

/// Label for records associated with a zone itself, rather than any names
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this only used as a key into the records map?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, though that map makes it out to the DNS server's HTTP API, so @ shows up in a few places internally. i'd meant to ask: do you have opinions about @ in particular, or a special label for these records in general?

@@ -4,9 +4,12 @@ load-example --seed test_expunge_newly_added_external_dns

blueprint-show 3f00b694-1b16-4aaa-8f78-e6b3a527b434
blueprint-edit 3f00b694-1b16-4aaa-8f78-e6b3a527b434 expunge-zone 9995de32-dd52-4eb1-b0eb-141eb84bc739
blueprint-diff 3f00b694-1b16-4aaa-8f78-e6b3a527b434 366b0b68-d80e-4bc1-abd3-dc69837847e0
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately, between the diff size and having conflicting changes on main, i had a hard time keeping the output a more legible "file moved and now has some additional lines". instead, git shows the diff as a fully new file even though it's mostly the prior content.

blueprint-diff includes the DNS output though, which is of course what i actually care about here. if this is a bear to review (and i'm pretty empathetic to it being a lot) i'm open to moving the DNS checking over to a new test and leaving this unchanged, or moving the internal DNS testing to live in this test as well.


blueprint-show 62422356-97cd-4e0f-bd17-f946c25193c1
blueprint-edit 62422356-97cd-4e0f-bd17-f946c25193c1 expunge-zone 3fc76516-d258-48bc-b25e-9fca5e37c888
blueprint-diff 62422356-97cd-4e0f-bd17-f946c25193c1 14b8ff1c-91ff-4ab7-bb64-3c0f5f642e09
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one surprised me and i've added this diff to reiterate that for testing: internal DNS zones are not replaced simply as a result of being expunged, since we might need to reuse the IP that server was listening on. for internal DNS in particular, the expunged zone must be ready_for_cleanup. i don't know concretely what that means (sled-agent did a collection and saw the zone is gone?), but that's a critical step in actually seeing DNS changes in the diff below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't know concretely what that means (sled-agent did a collection and saw the zone is gone?)

Almost! Reconfigurator will mark a zone ready for cleanup during planning if in the most recent inventory collection, sled-agent reported:

  • the zone is gone
  • the generation of the sled's config is >= the generation in which the zone was expunged (to avoid a race where the zone is gone because it hasn't even started yet)

iximeow added 5 commits May 22, 2025 20:33
on one hand: now that DNS servers are referenced by potentially two
different AAAA records, both of those records are potentially the target
of a SRV record. though, we don't have SRV records for the DNS
interface. this test had failed at first because we'd find a DNS
server's IP via the `ns1.` record, which means we'd miss that the same
zone was referenced by an AAAA record for the illumos zone UUID.

on the other hand: #[nexus_test] environments involve a mock of the
initial RSS environment construction? so now that the first blueprint
adds NS records, this mock RSS environment was out of date, and a test
that the first blueprint after "RSS" makes no change failed because the
"RSS" environment was wrong.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release notes reminder to include this in the release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants