You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jan 6, 2023. It is now read-only.
How to programmatically determine if a training job has finished using kubectl?
The field status.replicaStatuses.Worker.succeeded seems to indicate the number of succeeded pods.
How does one determine if the whole job has succeeded?
This is useful when the training job is part of a workflow (e.g. orchestrated by argo or airflow).
Please note that this issue tracker is not a help form and this issue will be closed.
Before submitting, please ensure you have gone through our documentation. Here
are some links that may be helpful:
Thanks for the question. This is actually something that can be improved in our current k8s controller. Since we don't rely on any type of gang scheduling but rather deploy each container as a Pod there isn't a scheduler provided "done" signal that we can use when you try to use an elastic job as part of a workflow. You can use certain heuristics, for instance, if you are in "non-elastic" mode (min == max), you could use the number of succeeded workers. Or have your job touch a "COMPLETE" file on S3 (or the likes) and kick off a downstream dependency based on that.
FWIW, we will integrate elastic into the existing pt operator in kubeflow (#117).
❓ Questions and Help
How to programmatically determine if a training job has finished using
kubectl
?The field
status.replicaStatuses.Worker.succeeded
seems to indicate the number of succeeded pods.How does one determine if the whole job has succeeded?
This is useful when the training job is part of a workflow (e.g. orchestrated by argo or airflow).
Please note that this issue tracker is not a help form and this issue will be closed.
Before submitting, please ensure you have gone through our documentation. Here
are some links that may be helpful:
Question
The text was updated successfully, but these errors were encountered: