You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
https://www.kubeflow.org/docs/components/training/overview/ "create a TFJob/PyTorchJob with required number PSs, workers, and GPUs using Training Operator Python SDK." But then it doesn't seem to be possible to add workers, or Ps from other machines using Training Operator Python SDK......just replicas within the single machine?
@akrupien Please can you explain what do you mean by "add workers, or Ps from other machines" ?
When you add more workers Training Operator will create more Kubernetes pods and those pods will be scheduled to the appropriate Kubernetes nodes.
You can also specify Pod Node Selector if you want Pods to be assigned to the specific Kubernetes node (machine).
@andreyvelich Thank you, I think you sort of read my mind, I would want a pod to be assigned to a single machine/computer. In my case I would want the pod to be assigned to the entire machine/computer.
By "add workers, or Ps from other machines", I mean I have multiple computers/machines, each has multiple
GPU's, and each computer/machine should be their own worker.
When I create a TFJob with the required number of workers using Training Operator, I'd expect it should match my TF config in my Tensorflow distributive training? So I should be able to add my individual computers/machines as workers?
I am using MultiWorkerMirroredStrategy in my Tensorflow distributive training with multiple computers/machines. Each Computer/Machine is their own worker.
https://www.kubeflow.org/docs/components/training/tftraining/
Tf Replica Spec in Training Operator SDK Doesn't seem to provide an option for adding individual machines as workers - only replicas within a single machine? But maybe I'm missing something under Spec if TFReplica spec is not necessary for TFJob.
Your link seems to assign a pod to a node. Is it possible in my situation to use pod affinity to add my multiple workers/computers/machines in TFJob?
In my situation a Node is a Machine which Is a Indidividual Computer which is it's own single pod.
I am essentially asking how to use Training Operator to add my workers/computers/machines as pods, to their node.
Ideally, my entire cluster would be a single pod but that doesn't seem possible.
How do you add other machines? This example just created replicas within a single machine? Is Kubeflow not capable of adding other machines?
The text was updated successfully, but these errors were encountered: