Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ddm-admin-client wraps calls with tokio::spawn-ed tasks that are never awaited #7378

Open
smklein opened this issue Jan 21, 2025 · 3 comments · May be fixed by #7507
Open

ddm-admin-client wraps calls with tokio::spawn-ed tasks that are never awaited #7378

smklein opened this issue Jan 21, 2025 · 3 comments · May be fixed by #7507
Labels
bug Something that isn't working. networking Related to the networking. Sled Agent Related to the Per-Sled Configuration and Management

Comments

@smklein
Copy link
Collaborator

smklein commented Jan 21, 2025

Within https://github.com/oxidecomputer/omicron/blob/main/clients/ddm-admin-client/src/lib.rs , we have a handful of tasks which are tokio::spawn-ed, and attempt to contact maghemite to configure state.

In a world where this state is purely additive, this may not be an issue - it requests that the configuration "eventually get propagated" to maghemite. However, in a world where we actually try to remove routes (e.g., after expunging a zone - see #7377), this code introduces problems.

It's possible that we do the following:

  • (tokio task) Start advertising a prefix
  • (tokio task) Fail, wait in retry loop
  • (elsewhere) Stop advertising prefix
  • (tokio task) Start advertising the same prefix, which we should not be advertising anymore

(aside: Due to the scheduling of tokio tasks, we actually don't even need to "fail" here, the scheduling of the tokio task could just be delayed, resulting in a call to DELETE before the call to CREATE the prefixes)

@smklein smklein added bug Something that isn't working. networking Related to the networking. Sled Agent Related to the Per-Sled Configuration and Management labels Jan 21, 2025
@davepacheco
Copy link
Collaborator

In discussion today we decided this is unlikely to affect sleds' bootstrap or underlay IPs (because we don't ever change those today) but could affect internal DNS IPs. Still, it would require a sequence where we decide to advertise a prefix, that fails, then we decide not to advertise it any more. For example, if we:

  • add an internal DNS zone on a sled
  • we fail to advertise that prefix
  • then we expunge that internal DNS zone
  • then we add an internal DNS zone with the same IP to a different sled

I think that's the sequence that would be required to create a problem (and if that happened, you'd see a similar impact to #7377).

@smklein
Copy link
Collaborator Author

smklein commented Jan 21, 2025

I think it's possible this happens even if we don't fail to advertise the prefix, but we are just slow to schedule the new spawned tokio task.

However, I admit, the timing is still tight for this to be possible

@davepacheco
Copy link
Collaborator

You're right -- the sequence is more like:

  • we start booting an internal DNS zone on a sled for the first time since sled boot (either because it was added or because the sled just rebooted)
  • before we successfully advertise the prefix, we receive a request to expunge the zone and, assuming we've fixed [sled agent] Zone initialization causes maghemite advertisement, but nothing stops this on zone teardown #7377, we've gone and deleted the prefix
  • then this tokio task runs again and re-creates the prefix
  • meanwhile, an internal DNS zone with the same IP is placed elsewhere, comes up, and clients try to talk to it, but reach this one and potentially get a wrong answer

If we do #7377 with a reconciler loop, then I think this problem should correct itself, but it remains a problem and could cause weird transient behavior.

@jgallagher jgallagher linked a pull request Feb 7, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that isn't working. networking Related to the networking. Sled Agent Related to the Per-Sled Configuration and Management
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants