Skip to content

Conversation

@jmpesp
Copy link
Contributor

@jmpesp jmpesp commented Dec 10, 2025

A local storage allocation is defined as a slice of the local storage dataset for a U2. This commit adds the logic to allocate the local storage datasets allocations during instance allocation. This is done by modifying the sled reservation query to:

  • find all the local storage disks attached to an instance
  • determine which of those do not have an allocation yet
  • match the requests for local storage allocation to zpools with available space, and find a valid configuration of allocations

These query changes take into account existing Crucible dataset usage, the control plane storage buffer, and the tombstone status of the local storage dataset.

instance_ensure_registered now fills in the delegated_zvols entries for local storage disks.

During the instance start saga, after the instance's state has been changed to starting, list all the local storage disks, grab their allocations, and send ensure requests to sled agents for those local storage allocations. This occurs after the VMM record was created, which means that either each of those local storage disks now have an allocation or the reservation failed.

One of the things that this PR also introduces is a repetition syntax for creating a batch of declared saga actions:

ENSURE_LOCAL_STORAGE (0, 1, 2, 3, 4, 5, 6, 7) -> "ensure_local_storage" {
    + sis_ensure_local_storage
    // No undo action for this, that is handled in the disk delete saga!
}

During the disk delete saga, local storage disks will go through a new path to clean up those allocations, involving a deallocation in the database and an delete request to the appropriate sled agent.

A local storage allocation is defined as a slice of the local storage
dataset for a U2. This commit adds the logic to allocate the local
storage datasets allocations during instance allocation. This is done by
modifying the sled reservation query to:

- find all the local storage disks attached to an instance
- determine which of those do not have an allocation yet
- match the requests for local storage allocation to zpools with
  available space, and find a valid configuration of allocations

These query changes take into account existing Crucible dataset usage,
the control plane storage buffer, and the tombstone status of the local
storage dataset.

`instance_ensure_registered` now fills in the delegated_zvols entries
for local storage disks.

During the instance start saga, _after_ the instance's state has been
changed to `starting`, list all the local storage disks, grab their
allocations, and send ensure requests to sled agents for those local
storage allocations. This occurs after the VMM record was created, which
means that either each of those local storage disks now have an
allocation or the reservation failed.

One of the things that this PR also introduces is a repetition syntax
for creating a batch of declared saga actions:

```
ENSURE_LOCAL_STORAGE (0, 1, 2, 3, 4, 5, 6, 7) -> "ensure_local_storage" {
    + sis_ensure_local_storage
    // No undo action for this, that is handled in the disk delete saga!
}
```

During the disk delete saga, local storage disks will go through a new
path to clean up those allocations, involving a deallocation in the
database and an delete request to the appropriate sled agent.
@jmpesp jmpesp requested a review from hawkw December 10, 2025 19:45
Copy link
Contributor

@ahl ahl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it fair to say that instance allocation may be a place where we focus optimization in the future? It seems like layering in constraints may make this code quadratically more complex. That said, I think it's perfect for where we are now.

good stuff

};

// If any local storage disks have been allocated already, then this
// constraints VMM placement and where other unallocated local storage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// constraints VMM placement and where other unallocated local storage
// constrains VMM placement and where other unallocated local storage

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 540 to 541
// It's an error for multiple sleds to host local storage
// disks, that makes no sense!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this reflect a failure to uphold some constraint? Or is this a scenario we could encounter... somehow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but it's also a scenario that client requests can place us in if a user detaches a local storage disk with an allocation that's on sled A, then attaches it to an instance that has attached local storage with existing allocations on sled B. I've put a comment in c67d8e9 and changed this error to a conflict instead. There's some follow up work to do here.

Comment on lines +551 to +565
// If there's already a list of sleds to use, that must be
// filtered further. Otherwise return the one sled that local
// storage disks have already been allocated on.
match maybe_must_use_sleds {
Some(must_use_sleds) => Some(
must_use_sleds
.into_iter()
.filter(|sled| {
local_storage_allocation_sleds.contains(&sled)
})
.collect(),
),

None => Some(local_storage_allocation_sleds),
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I may have gotten confused. I expected this to basically say "does maybe_must_use_sleds contain the (exactly one) item from local_storage_allocation_sleds--if it doesn't, that's going to be an error.

maybe_must_use_sleds
};

// If filtering has removed all the sleds from the constraint, then we
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this only occur after the filter at 558 or is there another path that might encounter this condition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It only occurs after the filter - previous to this commit nothing filtered the must_use_sleds.

Comment on lines +740 to +742
// If there's an existing local storage allocation on a zpool,
// remove that from the list of candidates. Local storage for
// the same Instance should not share any zpools.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain your thinking here? So if someone asks for an instance with, say, 10 100GB local disks, distributing those allocations does maximize IOPS--it also maximizes noisy neighbor opportunities! Perhaps this is theoretical for the current demand, but we might want to think this over.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there's more work to be done here. For the first iteration of this we could suggest users use the existing instance affinity mechanics to constrain placement.


builder.append(associate_ssh_keys_action());

// Helper function for appending subsagas to our parent saga.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moving this for hygiene?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep!

@hawkw
Copy link
Member

hawkw commented Dec 10, 2025

@jmpesp and I pair-reviewed a majority of this on a call today, and (among other smaller things) we determined that it will be necessary to change the disk attach operation to ensure that local storage has not been allocated to an instance (during a start saga) when additional local disks are attached; otherwise, it may be possible to create an instance that can never be started (because it has local disks on a sled that lacks the space to support an additional local disk).

@jmpesp
Copy link
Contributor Author

jmpesp commented Dec 11, 2025

@jmpesp and I pair-reviewed a majority of this on a call today, and (among other smaller things) we determined that it will be necessary to change the disk attach operation to ensure that local storage has not been allocated to an instance (during a start saga) when additional local disks are attached; otherwise, it may be possible to create an instance that can never be started (because it has local disks on a sled that lacks the space to support an additional local disk).

this is done in e9e9f57

Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thus far, I have mostly looked at the sled allocation query changes. Overall, these make sense to me, especially based on our discussion yesterday. I have a lot more files that I also need to review, but wanted to leave my minor suggestions on just the bit I've read thoroughly for now.

// Attaching a local storage disk to the instance has to be blocked
// if sled reservation has occured for this instance: local storage
// allocation records are only created during sled reservation, and
// importantly a particular configuration of local storage
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, sorry:

Suggested change
// importantly a particular configuration of local storage
// importantly, a particular configuration of local storage

// allocations for an instance are only _validated_ during the sled
// reservation.
//
// The instance start saga perform sled reservation, creates the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo:

Suggested change
// The instance start saga perform sled reservation, creates the
// The instance start saga performs sled reservation, creates the

Comment on lines +802 to +804
// The instance start saga perform sled reservation, creates the
// corresponding VMM record, and changes the instance's runtime
// state all in _separate saga nodes_. This means that there could
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take it or leave it: it may perhaps be worth noting something here about why this has to happen like that, so the future reader doesn't just think "well why not just change that part?". i recall that we went down that path briefly yesterday.

// reservation query and changing the instance's state.
//
// If a client attaches a local storage disk to an instance after
// sled reservation occurs but before the instance's start moves to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, perhaps:

Suggested change
// sled reservation occurs but before the instance's start moves to
// sled reservation occurs but before the instance's start saga moves to

//
// - if an allocation does not already exist for the local storage
// disk, the instance_start saga will fail (and unwind) when
// trying to ensure that the allocation's dataset and zvol exist
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// trying to ensure that the allocation's dataset and zvol exist
// trying to ensure that the allocation's dataset and zvol exist,

Comment on lines +820 to +821
// Not enough zpools to satisfy the number of allocations
// required. Find another sled!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i feel like this is a decision that ought to be logged (perhaps importanter than printing the list of zpools, tbh)

Comment on lines +882 to +887
// if there's no local storage datasets on this sled for
// this request's size, then try another sled.
sled_targets.remove(&sled_target);
banned.remove(&sled_target);
unpreferred.remove(&sled_target);
preferred.remove(&sled_target);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this also feels like a place where i would expect us to log "skipping this sled because it has no candidate datasets" or some such?

Comment on lines +927 to +933
.map(|zpool_get_result| CandidateDataset {
rendezvous_local_storage_dataset_id:
zpool_get_result
.rendezvous_local_storage_dataset_id,
pool_id: zpool_get_result.pool.id(),
sled_id: zpool_get_result.pool.sled_id(),
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

teensy nit, feel free to disregard me: there are a few ZpooLGetResult to CandidateDataset conversions in here, mayhaps we should have a function that does this?

Comment on lines +1057 to +1061
info!(
&log,
"calling insert with local storage allocations";
"allocations" => ?allocations,
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this also say what sled we are trying to allocate on? does allocations include that?

Comment on lines +1076 to +1079
info!(&log, "reservation succeeded!");
return Ok(resource);
}
info!(&log, "reservation failed");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's too bad that we don't get to say the allocations again here but i see why we don't

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants