Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to disable container spec validation when restoring checkpoints #11307

Open
cweld510 opened this issue Dec 18, 2024 · 3 comments
Open
Labels
type: enhancement New feature or request

Comments

@cweld510
Copy link
Contributor

cweld510 commented Dec 18, 2024

Description

We recently upgraded to a gvisor version containing extra validation of the container spec used when restoring checkpoints (i.e. all the logic here). Unfortunately, at the same time, we also discovered that we were inadvertently varying the container spec between checkpoint and restore in lots of subtle ways that, although technically incorrect, weren't causing any obvious bugs in our application. For example, varying the arguments to the container's init process between checkpoint and restore doesn't actually cause anything to break, because we consume the arguments into memory just once, before the checkpoint occurs. The new validation logic turns those small variances in the container spec into big failures that cause production issues.

Is it reasonable to add an option to disable the container spec validation? I'm fully aware that varying the container spec between checkpoint and restore is technically wrong, and I intend to fix it -- but in the interim, it would be nice to disable the validation to maintain stability.

Is this feature related to a specific bug?

No response

Do you have a specific solution in mind?

No response

@cweld510 cweld510 added the type: enhancement New feature or request label Dec 18, 2024
@EtiennePerot
Copy link
Contributor

Seems fine to me so long as it's flag-gated behind a scary-sounding flag, and that the behavior for the now-not-validated fields is consistent (will they all effectively inherit the pre-checkpoint values?)

@pawalt
Copy link
Contributor

pawalt commented Dec 19, 2024

@EtiennePerot yes my understanding is they'll effectively inherit the pre-checkpoint flag. I think it makes sense to put this behind a scary flag and to still emit a warning or error-level log so we can alert on validation errors.

@cweld510
Copy link
Contributor Author

I've created #11323 for this change - let me know if it's acceptable!

For the now-not-validated fields, I believe two things may happen. For fields that don't have any semantic effect at restore time, gvisor will behave according to the values given at checkpoint time. For other fields, differences in the checkpoint/restore specs will still cause loud restore failures in subcomponents of the kernel (e.g. a missing bind mount will cause filesystem restoration to fail).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants