-
Notifications
You must be signed in to change notification settings - Fork 824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(scheduler): Deal with deleted experiments when restoring from cache #5726
base: v2
Are you sure you want to change the base?
Conversation
scheduler/pkg/store/experiment/db.go
Outdated
err = startExperimentCb(experiment) | ||
if err != nil { | ||
return err | ||
// skip restoring the experiment if the callback returns an error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I suggest we slightly alter the comment to avoid confusion: the code doesn't skip anything, it simply swallows the error and logs a warning. If we would have bubbled the error up, that would end up stopping the scheduler with a Fatal error in main().
I was thinking of something like: "If the callback fails, do not bubble the error up but simply log it as a warning. The experiment restore is skipped instead of the scheduler failing due to the returned error."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, just a minor comment relative to the comment describing the code change.
@@ -73,13 +76,46 @@ func (edb *ExperimentDBManager) restore(startExperimentCb func(*Experiment) erro | |||
return err | |||
} | |||
experiment := CreateExperimentFromRequest(&snapshot) | |||
if experiment.Deleted { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the crux of the change. we store now deleted
and on restoring we just add the experiment to the in-memory store without (re)starting it.
This PR fixes a bug that re-loads deleted experiments after scheduler restarts. This is further complicated by the fact that these reloaded experiments are only visible from the scheduler state and not from kubernetes state.
The underlying cause was that we didn't check experiments state (whether they are deleted) when restoring from disk on scheduler restarts.
This PR adds this check and the necessary changes.
This PR also skips loading experiments if they fail validation. Importantly it will not fail the scheduler from starting is this happens a validation for a particular experiment fail. This is something that got exposed by this bug.
Note that for pipelines we do not have a validation step when restoring from disk.
Implementation
Adding
ExperimentSnapshot
proto inmlops/scheduler/storage.proto
, that has an extra fieldDeleted
to the experiment protos that we persist on disk so that on restore we can check whether the experiment is deleted.Added also
get
helper from DB (badgerdb) so that we can tests whats stored on disk. I also increased tests coverage while working on this area of the codebase.fixes: INFRA-1055 (internal)
TODO: