Local storage 3/4: Allocate and deallocate #9499

jmpesp · 2025-12-10T19:45:59Z

A local storage allocation is defined as a slice of the local storage dataset for a U2. This commit adds the logic to allocate the local storage datasets allocations during instance allocation. This is done by modifying the sled reservation query to:

find all the local storage disks attached to an instance
determine which of those do not have an allocation yet
match the requests for local storage allocation to zpools with available space, and find a valid configuration of allocations

These query changes take into account existing Crucible dataset usage, the control plane storage buffer, and the tombstone status of the local storage dataset.

instance_ensure_registered now fills in the delegated_zvols entries for local storage disks.

During the instance start saga, after the instance's state has been changed to starting, list all the local storage disks, grab their allocations, and send ensure requests to sled agents for those local storage allocations. This occurs after the VMM record was created, which means that either each of those local storage disks now have an allocation or the reservation failed.

One of the things that this PR also introduces is a repetition syntax for creating a batch of declared saga actions:

ENSURE_LOCAL_STORAGE (0, 1, 2, 3, 4, 5, 6, 7) -> "ensure_local_storage" {
    + sis_ensure_local_storage
    // No undo action for this, that is handled in the disk delete saga!
}

During the disk delete saga, local storage disks will go through a new path to clean up those allocations, involving a deallocation in the database and an delete request to the appropriate sled agent.

A local storage allocation is defined as a slice of the local storage dataset for a U2. This commit adds the logic to allocate the local storage datasets allocations during instance allocation. This is done by modifying the sled reservation query to: - find all the local storage disks attached to an instance - determine which of those do not have an allocation yet - match the requests for local storage allocation to zpools with available space, and find a valid configuration of allocations These query changes take into account existing Crucible dataset usage, the control plane storage buffer, and the tombstone status of the local storage dataset. `instance_ensure_registered` now fills in the delegated_zvols entries for local storage disks. During the instance start saga, _after_ the instance's state has been changed to `starting`, list all the local storage disks, grab their allocations, and send ensure requests to sled agents for those local storage allocations. This occurs after the VMM record was created, which means that either each of those local storage disks now have an allocation or the reservation failed. One of the things that this PR also introduces is a repetition syntax for creating a batch of declared saga actions: ``` ENSURE_LOCAL_STORAGE (0, 1, 2, 3, 4, 5, 6, 7) -> "ensure_local_storage" { + sis_ensure_local_storage // No undo action for this, that is handled in the disk delete saga! } ``` During the disk delete saga, local storage disks will go through a new path to clean up those allocations, involving a deallocation in the database and an delete request to the appropriate sled agent.

ahl

Is it fair to say that instance allocation may be a place where we focus optimization in the future? It seems like layering in constraints may make this code quadratically more complex. That said, I think it's perfect for where we are now.

good stuff

ahl · 2025-12-10T22:04:54Z

nexus/db-queries/src/db/datastore/sled.rs

            };

+        // If any local storage disks have been allocated already, then this
+        // constraints VMM placement and where other unallocated local storage


Suggested change

// constraints VMM placement and where other unallocated local storage

// constrains VMM placement and where other unallocated local storage

ahl · 2025-12-10T22:07:13Z

nexus/db-queries/src/db/datastore/sled.rs

+                    // It's an error for multiple sleds to host local storage
+                    // disks, that makes no sense!


Would this reflect a failure to uphold some constraint? Or is this a scenario we could encounter... somehow?

Yes, but it's also a scenario that client requests can place us in if a user detaches a local storage disk with an allocation that's on sled A, then attaches it to an instance that has attached local storage with existing allocations on sled B. I've put a comment in c67d8e9 and changed this error to a conflict instead. There's some follow up work to do here.

ahl · 2025-12-10T22:10:23Z

nexus/db-queries/src/db/datastore/sled.rs

+                // If there's already a list of sleds to use, that must be
+                // filtered further. Otherwise return the one sled that local
+                // storage disks have already been allocated on.
+                match maybe_must_use_sleds {
+                    Some(must_use_sleds) => Some(
+                        must_use_sleds
+                            .into_iter()
+                            .filter(|sled| {
+                                local_storage_allocation_sleds.contains(&sled)
+                            })
+                            .collect(),
+                    ),
+
+                    None => Some(local_storage_allocation_sleds),
+                }


I think I may have gotten confused. I expected this to basically say "does maybe_must_use_sleds contain the (exactly one) item from local_storage_allocation_sleds--if it doesn't, that's going to be an error.

ahl · 2025-12-10T22:11:31Z

nexus/db-queries/src/db/datastore/sled.rs

+            maybe_must_use_sleds
+        };
+
+        // If filtering has removed all the sleds from the constraint, then we


can this only occur after the filter at 558 or is there another path that might encounter this condition?

It only occurs after the filter - previous to this commit nothing filtered the must_use_sleds.

ahl · 2025-12-10T22:18:46Z

nexus/db-queries/src/db/datastore/sled.rs

+                // If there's an existing local storage allocation on a zpool,
+                // remove that from the list of candidates. Local storage for
+                // the same Instance should not share any zpools.


Can you explain your thinking here? So if someone asks for an instance with, say, 10 100GB local disks, distributing those allocations does maximize IOPS--it also maximizes noisy neighbor opportunities! Perhaps this is theoretical for the current demand, but we might want to think this over.

Yeah, there's more work to be done here. For the first iteration of this we could suggest users use the existing instance affinity mechanics to constrain placement.

ahl · 2025-12-10T22:27:09Z

nexus/src/app/sagas/instance_create.rs


        builder.append(associate_ssh_keys_action());

-        // Helper function for appending subsagas to our parent saga.


moving this for hygiene?

hawkw · 2025-12-10T23:02:28Z

@jmpesp and I pair-reviewed a majority of this on a call today, and (among other smaller things) we determined that it will be necessary to change the disk attach operation to ensure that local storage has not been allocated to an instance (during a start saga) when additional local disks are attached; otherwise, it may be possible to create an instance that can never be started (because it has local disks on a sled that lacks the space to support an additional local disk).

…up on expectorate output

jmpesp · 2025-12-11T15:08:12Z

@jmpesp and I pair-reviewed a majority of this on a call today, and (among other smaller things) we determined that it will be necessary to change the disk attach operation to ensure that local storage has not been allocated to an instance (during a start saga) when additional local disks are attached; otherwise, it may be possible to create an instance that can never be started (because it has local disks on a sled that lacks the space to support an additional local disk).

this is done in e9e9f57

hawkw

Thus far, I have mostly looked at the sled allocation query changes. Overall, these make sense to me, especially based on our discussion yesterday. I have a lot more files that I also need to review, but wanted to leave my minor suggestions on just the bit I've read thoroughly for now.

hawkw · 2025-12-11T21:57:58Z

nexus/db-queries/src/db/datastore/disk.rs

+            // Attaching a local storage disk to the instance has to be blocked
+            // if sled reservation has occured for this instance: local storage
+            // allocation records are only created during sled reservation, and
+            // importantly a particular configuration of local storage


nit, sorry:

Suggested change

// importantly a particular configuration of local storage

// importantly, a particular configuration of local storage

hawkw · 2025-12-11T21:58:11Z

nexus/db-queries/src/db/datastore/disk.rs

+            // allocations for an instance are only _validated_ during the sled
+            // reservation.
+            //
+            // The instance start saga perform sled reservation, creates the


typo:

Suggested change

// The instance start saga perform sled reservation, creates the

// The instance start saga performs sled reservation, creates the

hawkw · 2025-12-11T21:59:19Z

nexus/db-queries/src/db/datastore/disk.rs

+            // The instance start saga perform sled reservation, creates the
+            // corresponding VMM record, and changes the instance's runtime
+            // state all in _separate saga nodes_.  This means that there could


take it or leave it: it may perhaps be worth noting something here about why this has to happen like that, so the future reader doesn't just think "well why not just change that part?". i recall that we went down that path briefly yesterday.

hawkw · 2025-12-11T21:59:44Z

nexus/db-queries/src/db/datastore/disk.rs

+            // reservation query and changing the instance's state.
+            //
+            // If a client attaches a local storage disk to an instance after
+            // sled reservation occurs but before the instance's start moves to


nit, perhaps:

Suggested change

// sled reservation occurs but before the instance's start moves to

// sled reservation occurs but before the instance's start saga moves to

hawkw · 2025-12-11T22:00:15Z

nexus/db-queries/src/db/datastore/disk.rs

+            //
+            // - if an allocation does not already exist for the local storage
+            //   disk, the instance_start saga will fail (and unwind) when
+            //   trying to ensure that the allocation's dataset and zvol exist


Suggested change

// trying to ensure that the allocation's dataset and zvol exist

// trying to ensure that the allocation's dataset and zvol exist,

hawkw · 2025-12-11T22:35:05Z

nexus/db-queries/src/db/datastore/sled.rs

+                    // Not enough zpools to satisfy the number of allocations
+                    // required. Find another sled!


i feel like this is a decision that ought to be logged (perhaps importanter than printing the list of zpools, tbh)

hawkw · 2025-12-11T22:38:29Z

nexus/db-queries/src/db/datastore/sled.rs

+                        // if there's no local storage datasets on this sled for
+                        // this request's size, then try another sled.
+                        sled_targets.remove(&sled_target);
+                        banned.remove(&sled_target);
+                        unpreferred.remove(&sled_target);
+                        preferred.remove(&sled_target);


this also feels like a place where i would expect us to log "skipping this sled because it has no candidate datasets" or some such?

hawkw · 2025-12-11T22:40:13Z

nexus/db-queries/src/db/datastore/sled.rs

+                        .map(|zpool_get_result| CandidateDataset {
+                            rendezvous_local_storage_dataset_id:
+                                zpool_get_result
+                                    .rendezvous_local_storage_dataset_id,
+                            pool_id: zpool_get_result.pool.id(),
+                            sled_id: zpool_get_result.pool.sled_id(),
+                        })


teensy nit, feel free to disregard me: there are a few ZpooLGetResult to CandidateDataset conversions in here, mayhaps we should have a function that does this?

hawkw · 2025-12-11T22:42:36Z

nexus/db-queries/src/db/datastore/sled.rs

+                    info!(
+                        &log,
+                        "calling insert with local storage allocations";
+                        "allocations" => ?allocations,
+                    );


does this also say what sled we are trying to allocate on? does allocations include that?

hawkw · 2025-12-11T22:43:02Z

nexus/db-queries/src/db/datastore/sled.rs

+                        info!(&log, "reservation succeeded!");
+                        return Ok(resource);
+                    }
+                    info!(&log, "reservation failed");


it's too bad that we don't get to say the allocations again here but i see why we don't

jmpesp requested a review from hawkw December 10, 2025 19:45

ahl reviewed Dec 10, 2025

View reviewed changes

jmpesp added 11 commits December 11, 2025 01:25

live review: use "on_conn"

0dcf071

fmt

e08228d

sled reservation must block attaching local storage disks

e9e9f57

remove literal duplicate code

8b07137

word wrap, remove "no_provision = FALSE" when not needed, and double …

bb4509c

…up on expectorate output

expectorate from query changes

57b0434

missed use

e417fc4

fmt

d6e7e1b

return conflict, case can happen due to client requests

c67d8e9

log during sled reservation

f77aa55

typo

827bb37

hawkw reviewed Dec 11, 2025

View reviewed changes

	// constraints VMM placement and where other unallocated local storage
	// constrains VMM placement and where other unallocated local storage

		// It's an error for multiple sleds to host local storage
		// disks, that makes no sense!


		builder.append(associate_ssh_keys_action());

		// Helper function for appending subsagas to our parent saga.

	// importantly a particular configuration of local storage
	// importantly, a particular configuration of local storage

	// The instance start saga perform sled reservation, creates the
	// The instance start saga performs sled reservation, creates the

	// sled reservation occurs but before the instance's start moves to
	// sled reservation occurs but before the instance's start saga moves to

	// trying to ensure that the allocation's dataset and zvol exist
	// trying to ensure that the allocation's dataset and zvol exist,

		// Not enough zpools to satisfy the number of allocations
		// required. Find another sled!

Local storage 3/4: Allocate and deallocate #9499

Are you sure you want to change the base?

Local storage 3/4: Allocate and deallocate #9499

Uh oh!

Conversation

jmpesp commented Dec 10, 2025

Uh oh!

ahl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hawkw commented Dec 10, 2025

Uh oh!

jmpesp commented Dec 11, 2025

Uh oh!

hawkw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants