How to programmatically determine if a training job has finished using `kubectl`? #130

darthsuogles · 2020-10-03T06:12:30Z

❓ Questions and Help

How to programmatically determine if a training job has finished using kubectl?
The field status.replicaStatuses.Worker.succeeded seems to indicate the number of succeeded pods.
How does one determine if the whole job has succeeded?
This is useful when the training job is part of a workflow (e.g. orchestrated by argo or airflow).

Please note that this issue tracker is not a help form and this issue will be closed.

Before submitting, please ensure you have gone through our documentation. Here
are some links that may be helpful:

Question

The text was updated successfully, but these errors were encountered:

darthsuogles · 2020-10-05T18:35:43Z

CC: @kiukchung

kiukchung · 2020-10-05T19:03:32Z

Thanks for the question. This is actually something that can be improved in our current k8s controller. Since we don't rely on any type of gang scheduling but rather deploy each container as a Pod there isn't a scheduler provided "done" signal that we can use when you try to use an elastic job as part of a workflow. You can use certain heuristics, for instance, if you are in "non-elastic" mode (min == max), you could use the number of succeeded workers. Or have your job touch a "COMPLETE" file on S3 (or the likes) and kick off a downstream dependency based on that.

FWIW, we will integrate elastic into the existing pt operator in kubeflow (#117).

darthsuogles closed this as completed Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to programmatically determine if a training job has finished using `kubectl`? #130

How to programmatically determine if a training job has finished using `kubectl`? #130

darthsuogles commented Oct 3, 2020

darthsuogles commented Oct 5, 2020

kiukchung commented Oct 5, 2020

How to programmatically determine if a training job has finished using kubectl? #130

How to programmatically determine if a training job has finished using kubectl? #130

Comments

darthsuogles commented Oct 3, 2020

❓ Questions and Help

Please note that this issue tracker is not a help form and this issue will be closed.

Question

darthsuogles commented Oct 5, 2020

kiukchung commented Oct 5, 2020

How to programmatically determine if a training job has finished using `kubectl`? #130

How to programmatically determine if a training job has finished using `kubectl`? #130