-
Notifications
You must be signed in to change notification settings - Fork 63
Local storage 3/4: Allocate and deallocate #9499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
A local storage allocation is defined as a slice of the local storage
dataset for a U2. This commit adds the logic to allocate the local
storage datasets allocations during instance allocation. This is done by
modifying the sled reservation query to:
- find all the local storage disks attached to an instance
- determine which of those do not have an allocation yet
- match the requests for local storage allocation to zpools with
available space, and find a valid configuration of allocations
These query changes take into account existing Crucible dataset usage,
the control plane storage buffer, and the tombstone status of the local
storage dataset.
`instance_ensure_registered` now fills in the delegated_zvols entries
for local storage disks.
During the instance start saga, _after_ the instance's state has been
changed to `starting`, list all the local storage disks, grab their
allocations, and send ensure requests to sled agents for those local
storage allocations. This occurs after the VMM record was created, which
means that either each of those local storage disks now have an
allocation or the reservation failed.
One of the things that this PR also introduces is a repetition syntax
for creating a batch of declared saga actions:
```
ENSURE_LOCAL_STORAGE (0, 1, 2, 3, 4, 5, 6, 7) -> "ensure_local_storage" {
+ sis_ensure_local_storage
// No undo action for this, that is handled in the disk delete saga!
}
```
During the disk delete saga, local storage disks will go through a new
path to clean up those allocations, involving a deallocation in the
database and an delete request to the appropriate sled agent.
ahl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it fair to say that instance allocation may be a place where we focus optimization in the future? It seems like layering in constraints may make this code quadratically more complex. That said, I think it's perfect for where we are now.
good stuff
| }; | ||
|
|
||
| // If any local storage disks have been allocated already, then this | ||
| // constraints VMM placement and where other unallocated local storage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // constraints VMM placement and where other unallocated local storage | |
| // constrains VMM placement and where other unallocated local storage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // It's an error for multiple sleds to host local storage | ||
| // disks, that makes no sense! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this reflect a failure to uphold some constraint? Or is this a scenario we could encounter... somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but it's also a scenario that client requests can place us in if a user detaches a local storage disk with an allocation that's on sled A, then attaches it to an instance that has attached local storage with existing allocations on sled B. I've put a comment in c67d8e9 and changed this error to a conflict instead. There's some follow up work to do here.
| // If there's already a list of sleds to use, that must be | ||
| // filtered further. Otherwise return the one sled that local | ||
| // storage disks have already been allocated on. | ||
| match maybe_must_use_sleds { | ||
| Some(must_use_sleds) => Some( | ||
| must_use_sleds | ||
| .into_iter() | ||
| .filter(|sled| { | ||
| local_storage_allocation_sleds.contains(&sled) | ||
| }) | ||
| .collect(), | ||
| ), | ||
|
|
||
| None => Some(local_storage_allocation_sleds), | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I may have gotten confused. I expected this to basically say "does maybe_must_use_sleds contain the (exactly one) item from local_storage_allocation_sleds--if it doesn't, that's going to be an error.
| maybe_must_use_sleds | ||
| }; | ||
|
|
||
| // If filtering has removed all the sleds from the constraint, then we |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this only occur after the filter at 558 or is there another path that might encounter this condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It only occurs after the filter - previous to this commit nothing filtered the must_use_sleds.
| // If there's an existing local storage allocation on a zpool, | ||
| // remove that from the list of candidates. Local storage for | ||
| // the same Instance should not share any zpools. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain your thinking here? So if someone asks for an instance with, say, 10 100GB local disks, distributing those allocations does maximize IOPS--it also maximizes noisy neighbor opportunities! Perhaps this is theoretical for the current demand, but we might want to think this over.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, there's more work to be done here. For the first iteration of this we could suggest users use the existing instance affinity mechanics to constrain placement.
|
|
||
| builder.append(associate_ssh_keys_action()); | ||
|
|
||
| // Helper function for appending subsagas to our parent saga. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moving this for hygiene?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep!
|
@jmpesp and I pair-reviewed a majority of this on a call today, and (among other smaller things) we determined that it will be necessary to change the disk attach operation to ensure that local storage has not been allocated to an instance (during a start saga) when additional local disks are attached; otherwise, it may be possible to create an instance that can never be started (because it has local disks on a sled that lacks the space to support an additional local disk). |
…up on expectorate output
this is done in e9e9f57 |
hawkw
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thus far, I have mostly looked at the sled allocation query changes. Overall, these make sense to me, especially based on our discussion yesterday. I have a lot more files that I also need to review, but wanted to leave my minor suggestions on just the bit I've read thoroughly for now.
| // Attaching a local storage disk to the instance has to be blocked | ||
| // if sled reservation has occured for this instance: local storage | ||
| // allocation records are only created during sled reservation, and | ||
| // importantly a particular configuration of local storage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, sorry:
| // importantly a particular configuration of local storage | |
| // importantly, a particular configuration of local storage |
| // allocations for an instance are only _validated_ during the sled | ||
| // reservation. | ||
| // | ||
| // The instance start saga perform sled reservation, creates the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo:
| // The instance start saga perform sled reservation, creates the | |
| // The instance start saga performs sled reservation, creates the |
| // The instance start saga perform sled reservation, creates the | ||
| // corresponding VMM record, and changes the instance's runtime | ||
| // state all in _separate saga nodes_. This means that there could |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
take it or leave it: it may perhaps be worth noting something here about why this has to happen like that, so the future reader doesn't just think "well why not just change that part?". i recall that we went down that path briefly yesterday.
| // reservation query and changing the instance's state. | ||
| // | ||
| // If a client attaches a local storage disk to an instance after | ||
| // sled reservation occurs but before the instance's start moves to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, perhaps:
| // sled reservation occurs but before the instance's start moves to | |
| // sled reservation occurs but before the instance's start saga moves to |
| // | ||
| // - if an allocation does not already exist for the local storage | ||
| // disk, the instance_start saga will fail (and unwind) when | ||
| // trying to ensure that the allocation's dataset and zvol exist |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // trying to ensure that the allocation's dataset and zvol exist | |
| // trying to ensure that the allocation's dataset and zvol exist, |
| // Not enough zpools to satisfy the number of allocations | ||
| // required. Find another sled! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i feel like this is a decision that ought to be logged (perhaps importanter than printing the list of zpools, tbh)
| // if there's no local storage datasets on this sled for | ||
| // this request's size, then try another sled. | ||
| sled_targets.remove(&sled_target); | ||
| banned.remove(&sled_target); | ||
| unpreferred.remove(&sled_target); | ||
| preferred.remove(&sled_target); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this also feels like a place where i would expect us to log "skipping this sled because it has no candidate datasets" or some such?
| .map(|zpool_get_result| CandidateDataset { | ||
| rendezvous_local_storage_dataset_id: | ||
| zpool_get_result | ||
| .rendezvous_local_storage_dataset_id, | ||
| pool_id: zpool_get_result.pool.id(), | ||
| sled_id: zpool_get_result.pool.sled_id(), | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
teensy nit, feel free to disregard me: there are a few ZpooLGetResult to CandidateDataset conversions in here, mayhaps we should have a function that does this?
| info!( | ||
| &log, | ||
| "calling insert with local storage allocations"; | ||
| "allocations" => ?allocations, | ||
| ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this also say what sled we are trying to allocate on? does allocations include that?
| info!(&log, "reservation succeeded!"); | ||
| return Ok(resource); | ||
| } | ||
| info!(&log, "reservation failed"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's too bad that we don't get to say the allocations again here but i see why we don't
A local storage allocation is defined as a slice of the local storage dataset for a U2. This commit adds the logic to allocate the local storage datasets allocations during instance allocation. This is done by modifying the sled reservation query to:
These query changes take into account existing Crucible dataset usage, the control plane storage buffer, and the tombstone status of the local storage dataset.
instance_ensure_registerednow fills in the delegated_zvols entries for local storage disks.During the instance start saga, after the instance's state has been changed to
starting, list all the local storage disks, grab their allocations, and send ensure requests to sled agents for those local storage allocations. This occurs after the VMM record was created, which means that either each of those local storage disks now have an allocation or the reservation failed.One of the things that this PR also introduces is a repetition syntax for creating a batch of declared saga actions:
During the disk delete saga, local storage disks will go through a new path to clean up those allocations, involving a deallocation in the database and an delete request to the appropriate sled agent.