Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(scheduler): Deal with deleted experiments when restoring from cache #5726

Open
wants to merge 34 commits into
base: v2
Choose a base branch
from

Conversation

sakoush
Copy link
Member

@sakoush sakoush commented Jun 27, 2024

This PR fixes a bug that re-loads deleted experiments after scheduler restarts. This is further complicated by the fact that these reloaded experiments are only visible from the scheduler state and not from kubernetes state.

The underlying cause was that we didn't check experiments state (whether they are deleted) when restoring from disk on scheduler restarts.

This PR adds this check and the necessary changes.

This PR also skips loading experiments if they fail validation. Importantly it will not fail the scheduler from starting is this happens a validation for a particular experiment fail. This is something that got exposed by this bug.

Note that for pipelines we do not have a validation step when restoring from disk.

Implementation

Adding ExperimentSnapshot proto in mlops/scheduler/storage.proto, that has an extra field Deleted to the experiment protos that we persist on disk so that on restore we can check whether the experiment is deleted.

Added also get helper from DB (badgerdb) so that we can tests whats stored on disk. I also increased tests coverage while working on this area of the codebase.

fixes: INFRA-1055 (internal)

TODO:

  • Add migration path for experiments DB.

@sakoush sakoush requested a review from lc525 as a code owner June 27, 2024 17:01
@sakoush sakoush added the v2 label Jun 27, 2024
err = startExperimentCb(experiment)
if err != nil {
return err
// skip restoring the experiment if the callback returns an error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I suggest we slightly alter the comment to avoid confusion: the code doesn't skip anything, it simply swallows the error and logs a warning. If we would have bubbled the error up, that would end up stopping the scheduler with a Fatal error in main().

I was thinking of something like: "If the callback fails, do not bubble the error up but simply log it as a warning. The experiment restore is skipped instead of the scheduler failing due to the returned error."

Copy link
Member

@lc525 lc525 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, just a minor comment relative to the comment describing the code change.

@sakoush sakoush marked this pull request as draft June 28, 2024 11:52
@sakoush sakoush marked this pull request as ready for review July 1, 2024 18:09
@sakoush sakoush requested a review from lc525 July 1, 2024 18:09
@@ -73,13 +76,46 @@ func (edb *ExperimentDBManager) restore(startExperimentCb func(*Experiment) erro
return err
}
experiment := CreateExperimentFromRequest(&snapshot)
if experiment.Deleted {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the crux of the change. we store now deleted and on restoring we just add the experiment to the in-memory store without (re)starting it.

@sakoush sakoush changed the title fix(scheduler): Skip bad experiments when restoring from cache fix(scheduler): Deal with deleted experiments when restoring from cache Jul 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants