Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More robust validation and error handling for job worker selection #716

Open
robertbartel opened this issue Sep 10, 2024 · 0 comments
Open
Labels
bug Something isn't working maas MaaS Workstream

Comments

@robertbartel
Copy link
Contributor

Fix the handling of worker selection for requested jobs in Launcher class (and potentially subclasses; see #703) of dmod.scheduler. The current implementation is quite brittle. It does not account for potential non-default configuration adjustments in the deployment at large, or properly handle all possible situations (a problem that will be amplified after #662).

Current behavior

In the current, Docker-based implementation, the determine_image_for_job function handles job worker Docker image selection. It hard-codes the registry to 127.0.0.1:5000, even though DMOD supports configuring the internal registry differently via the deployment environment config. E.g., set DOCKER_INTERNAL_REGISTRY to something else in the .env config file, and that will be what's used when worker images are built and pushed, but not what the Launcher tries to use.

A related but separate flaw: there is limited (if any) validation of whether the desired image exists in the referenced registry, or graceful error handling when it does not. Even if the right registry is configured, the desired image may not have been pushed (yet). Strictly speaking this is a potential problem even with the current hard-coded restriction to "127.0.0.1:5000/ngen:latest", and once #662 is complete, the practical situations when this could happen expand greatly.

Expected behavior

The Launcher class should properly reflect non-default configuration settings for the internal Docker registry, or otherwise be able to synchronize its behavior to align with such settings elsewhere. It should also be able to gracefully handle (in tandem with other DMOD classes and services) the error condition of an expected or requested job worker image version not being available in the registry configured for the deployment.

If other similar subclasses are developed related to #703 before this issue is resolved, they should also properly reflect/align with configuration as applicable for that implementation, and gracefully handle expected worker versions being unavailable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working maas MaaS Workstream
Projects
None yet
Development

No branches or pull requests

1 participant