Skip to content

Conversation

simongdavies
Copy link
Contributor

Fixes a race condition where a sandbox kill arrives after a sandbox has successfully exited causing the subsequent run to fail.

There is a breaking change in this PR, previously if kill was called on an InterruptHandle before or while a guest call was not in progress the next guest call made on the Sandbox would be cancelled , now this scenario is a no-op. kill only takes effect if there is a guest call running.

/// retrying until either:
/// - The signal is successfully delivered (VCPU transitions from running to not running)
/// - The VCPU stops running for another reason (e.g., call completes normally)
/// - No call is active (call_active=false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it block? I thought it would just return false?

// The virtualization stack can use this function to return the control
// of a virtual processor back to the virtualization stack in case it
// needs to change the state of a VM or to inject an event into the processor
debug!("Internal cancellation detected, returning Retry error");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// This is acceptable because the generation tracking provides an additional
/// safety layer. Even if a stale kill somehow stamped cancel_requested, the
/// generation mismatch would cause it to be ignored.
call_active: AtomicBool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this flag is redundant. We will already not send any signals when calling kill if the vcpu is not running

simongdavies and others added 13 commits October 23, 2025 19:07
Signed-off-by: James Sturtevant <[email protected]>
Signed-off-by: James Sturtevant <[email protected]>
Signed-off-by: James Sturtevant <[email protected]>
Signed-off-by: James Sturtevant <[email protected]>
Signed-off-by: James Sturtevant <[email protected]>
Signed-off-by: Simon Davies <[email protected]>
Signed-off-by: Simon Davies <[email protected]>
Signed-off-by: Simon Davies <[email protected]>
Signed-off-by: Simon Davies <[email protected]>
Signed-off-by: Simon Davies <[email protected]>
Signed-off-by: Simon Davies <[email protected]>
Signed-off-by: Simon Davies <[email protected]>
Signed-off-by: Simon Davies <[email protected]>
Signed-off-by: Simon Davies <[email protected]>
Signed-off-by: Simon Davies <[email protected]>
Signed-off-by: Simon Davies <[email protected]>
Signed-off-by: Simon Davies <[email protected]>
Signed-off-by: Simon Davies <[email protected]>
///
/// This is only called on Linux (KVM/MSHV). Windows uses a different interrupt mechanism.
#[cfg(any(kvm, mshv))]
fn increment_call_generation(&self) -> u64;
Copy link
Contributor

@ludfjig ludfjig Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these all method are public right now. I recommend making a new private trait InterruptHandleImpl like so pub(crate) trait InterruptHandleImpl: InterruptHandle and store a dyn of this trait, but return it as a dyn InterruptHandle to the user

}

// Default implementations for common methods
impl dyn InterruptHandle {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you can put default method directly in the trait definition, no need to impl dyn InterruptHandle

) -> Result<ReturnValue> {
// Mark that a guest function call is now active
// (This also increments the generation counter internally)
self.vm.interrupt_handle().set_call_active();
Copy link
Contributor

@ludfjig ludfjig Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could consider making this a function which takes closure like

start_call(||{
...rest of existing code
})

where the function calls set_call_active and clear_call_active for you (and calls your closure in between).

Similar to existing maybe_time_and_emit_host_call

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bugfix For PRs that fix bugs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants