fix(scheduler): Deal with deleted experiments when restoring from cache #5726

sakoush · 2024-06-27T17:01:21Z

This PR fixes a bug that re-loads deleted experiments after scheduler restarts. This is further complicated by the fact that these reloaded experiments are only visible from the scheduler state and not from kubernetes state.

The underlying cause was that we didn't check experiments state (whether they are deleted) when restoring from disk on scheduler restarts.

This PR adds this check and the necessary changes.

This PR also skips loading experiments if they fail validation. Importantly it will not fail the scheduler from starting is this happens a validation for a particular experiment fail. This is something that got exposed by this bug.

Note that for pipelines we do not have a validation step when restoring from disk.

Implementation

Adding ExperimentSnapshot proto in mlops/scheduler/storage.proto, that has an extra field Deleted to the experiment protos that we persist on disk so that on restore we can check whether the experiment is deleted.

Added also get helper from DB (badgerdb) so that we can tests whats stored on disk. I also increased tests coverage while working on this area of the codebase.

fixes: INFRA-1055 (internal)

TODO:

Add migration path for experiments DB.

lc525 · 2024-06-27T17:28:17Z

scheduler/pkg/store/experiment/db.go

 				err = startExperimentCb(experiment)
 				if err != nil {
-					return err
+					// skip restoring the experiment if the callback returns an error


May I suggest we slightly alter the comment to avoid confusion: the code doesn't skip anything, it simply swallows the error and logs a warning. If we would have bubbled the error up, that would end up stopping the scheduler with a Fatal error in main().

I was thinking of something like: "If the callback fails, do not bubble the error up but simply log it as a warning. The experiment restore is skipped instead of the scheduler failing due to the returned error."

lc525

lgtm, just a minor comment relative to the comment describing the code change.

sakoush · 2024-07-01T18:14:15Z

scheduler/pkg/store/experiment/db.go

@@ -73,13 +76,46 @@ func (edb *ExperimentDBManager) restore(startExperimentCb func(*Experiment) erro
 					return err
 				}
 				experiment := CreateExperimentFromRequest(&snapshot)
+				if experiment.Deleted {


this is the crux of the change. we store now deleted and on restoring we just add the experiment to the in-memory store without (re)starting it.

sakoush added 5 commits June 27, 2024 17:28

remove dead code path

53f16ef

skip restoring an experiment if there is an error.

a3b2346

add a note that we do not validate pipelines when we restore them

bf2b506

Add test

c74a958

fix fmt

7a44d69

sakoush requested a review from lc525 as a code owner June 27, 2024 17:01

sakoush added the v2 label Jun 27, 2024

lc525 reviewed Jun 27, 2024

View reviewed changes

lc525 approved these changes Jun 27, 2024

View reviewed changes

sakoush added 8 commits June 27, 2024 20:05

update note in code

a20ced5

fix bug in tag

3831568

deal with deleted experiments on restore

bf1c14e

use a call back for deleted experiments

a8d30f1

add test for multiple experiments in db

7145062

update store to mark deleted experiments

2d7269d

add experiment get (for testing)

cbd365a

Add active field in experiment protos

c35f9fc

sakoush marked this pull request as draft June 28, 2024 11:52

sakoush added 9 commits July 1, 2024 14:13

add deleted instead of active

dc0dd34

make deleted field not optional

ca07e42

handle deleted in controller for experiments

2ecfb58

fix restoring of experiments

4a29cb5

add compare for the entire proto

2c69877

add pipeline get from db helper (for testing)

8764d24

fix lint

22ce843

add test for db check after adding pipeline

c3c579d

add nil check from pipeline add

a975b58

sakoush marked this pull request as ready for review July 1, 2024 18:09

sakoush requested a review from lc525 July 1, 2024 18:09

sakoush commented Jul 1, 2024

View reviewed changes

sakoush added 2 commits July 2, 2024 09:36

introduce ExperimentSnapshot proto

e30fcb3

add testing coverage

55f1e06

sakoush changed the title ~~fix(scheduler): Skip bad experiments when restoring from cache~~ fix(scheduler): Deal with deleted experiments when restoring from cache Jul 2, 2024

sakoush added 10 commits July 2, 2024 11:04

revert changes to operator as they are not required anymore

cebbc8f

add experiment db migration helper

1d71868

reinstate delete helper for dbs

aee3d49

simplify get from DB

5c78e0e

add testing for delete from db

faf43b3

add scafolding to get the version from the (experiment) db

715d009

use dropall helper to clear db

b0a1817

optimize how to migrate to the new version

a4d469e

refactor common code to utils

bc15028

add version to pipelinedb

6c608d6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(scheduler): Deal with deleted experiments when restoring from cache #5726

fix(scheduler): Deal with deleted experiments when restoring from cache #5726

sakoush commented Jun 27, 2024 •

edited

Loading

lc525 Jun 27, 2024

lc525 left a comment

sakoush Jul 1, 2024

fix(scheduler): Deal with deleted experiments when restoring from cache #5726

Are you sure you want to change the base?

fix(scheduler): Deal with deleted experiments when restoring from cache #5726

Conversation

sakoush commented Jun 27, 2024 • edited Loading

Implementation

lc525 Jun 27, 2024

Choose a reason for hiding this comment

lc525 left a comment

Choose a reason for hiding this comment

sakoush Jul 1, 2024

Choose a reason for hiding this comment

sakoush commented Jun 27, 2024 •

edited

Loading