Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(scheduler): Scheduler wait on server connect #5930

Merged
merged 41 commits into from
Sep 24, 2024

Conversation

sakoush
Copy link
Member

@sakoush sakoush commented Sep 23, 2024

What this PR does / why we need it:

This change introduces a Synchroniser on the scheduler side to coordinate the startup process and especially to attempt to wait for expected model servers to connect before proceeding with any control plane operations. This is important in the case that the scheduler restarts so that we do no interrupt ongoing dataplane operations that should be valid to be served.

There are 2 implementations for this Synchroniser:

  • SimpleSynchroniser: a timeout based sync for non k8s deployments
  • ServerBasedSynchroniser: this is for k8s deployments that tries to match expected server states (from etcd k8s controller) with actual agents connecting to the scheduler. More description provided here.

Other changes in this PR:

  • Explicitly mark the first ServerNotify event from the controller to indicate the list of expected servers collected from k8s etcd.
  • Add ServerEvent in event hub so that the system propagates events related to servers, which are then consumed by the Synchronise. Thanks to @lc525 for this change.
  • More test coverage for related part of the code base.
  • Added param for helm to specify this value.

This PR builds on top of #5893.

Which issue(s) this PR fixes:

Fixes #
INFRA-747 (internal)

@sakoush sakoush requested a review from lc525 as a code owner September 23, 2024 12:02
@sakoush sakoush added the v2 label Sep 23, 2024
@sakoush sakoush marked this pull request as draft September 23, 2024 12:02
@sakoush
Copy link
Member Author

sakoush commented Sep 23, 2024

The idea of the test with a pipeline is to make sure that we dont have any dropped messages.
Steps:

  • Run k6 load
  • Restart scheduler while k6 load is going

on 2.8.3 we have 15 messages dropped
image

with this change we dont have any dropped messages
image

@sakoush sakoush marked this pull request as ready for review September 24, 2024 08:36
Copy link
Member

@lc525 lc525 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, with some really minor comments. I think this addition is quite valuable in making the scheduler behave a lot better on restarts!

@sakoush sakoush merged commit c7aa567 into SeldonIO:v2 Sep 24, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants