Skip to content
This repository has been archived by the owner on Jan 6, 2023. It is now read-only.

How to programmatically determine if a training job has finished using kubectl? #130

Closed
darthsuogles opened this issue Oct 3, 2020 · 2 comments

Comments

@darthsuogles
Copy link

❓ Questions and Help

How to programmatically determine if a training job has finished using kubectl?
The field status.replicaStatuses.Worker.succeeded seems to indicate the number of succeeded pods.
How does one determine if the whole job has succeeded?
This is useful when the training job is part of a workflow (e.g. orchestrated by argo or airflow).

Please note that this issue tracker is not a help form and this issue will be closed.

Before submitting, please ensure you have gone through our documentation. Here
are some links that may be helpful:

Question

@darthsuogles
Copy link
Author

CC: @kiukchung

@kiukchung
Copy link
Contributor

Thanks for the question. This is actually something that can be improved in our current k8s controller. Since we don't rely on any type of gang scheduling but rather deploy each container as a Pod there isn't a scheduler provided "done" signal that we can use when you try to use an elastic job as part of a workflow. You can use certain heuristics, for instance, if you are in "non-elastic" mode (min == max), you could use the number of succeeded workers. Or have your job touch a "COMPLETE" file on S3 (or the likes) and kick off a downstream dependency based on that.

FWIW, we will integrate elastic into the existing pt operator in kubeflow (#117).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants