Submit ETJob as training job.
Submit ETJob as training job.
arena submit etjob [flags]
-a, --annotation strings the annotations, usage: "--annotation=key=value" or "--annotation key=value"
--config-file strings giving configuration files when submiting jobs,usage:"--config-file <host_path_file>:<container_path_file>"
--cpu string the cpu resource to use for the training, like 1 for 1 core.
-d, --data strings specify the datasource to mount to the job, like <name_of_datasource>:<mount_point_on_job>
--data-dir strings the data dir. If you specify /data, it means mounting hostpath /data into container path /data
--device stringArray specify the chip vendors and count that used for resources, such as amd.com/gpu=1 gpu.intel.com/i915=1.
-e, --env strings the environment variables
--gang enable gang scheduling
--gpus int the GPU count of each worker to run the training.
-h, --help help for etjob
--image string the docker image name of training job
--image-pull-secret strings giving names of imagePullSecret when you want to use a private registry, usage:"--image-pull-secret <name1>"
--logdir string the training logs dir, default is /training_logs (default "/training_logs")
--max-workers int the max worker number to run the distributed training. (default 1000)
--memory string the memory resource to use for the training, like 1Gi.
--min-workers int the min worker number to run the distributed training. (default 1)
--name string override name
-p, --priority string priority class name
--rdma enable RDMA
--retry int retry times.
--selector strings assigning jobs to some k8s particular nodes, usage: "--selector=key=value" or "--selector key=value"
--sync-image string the docker image of syncImage
--sync-mode string syncMode: support rsync, hdfs, git
--sync-source string sync-source: for rsync, it's like 10.88.29.56::backup/data/logoRecoTrain.zip; for git, it's like https://github.com/kubeflow/tf-operator.git
--tensorboard enable tensorboard
--tensorboard-image string the docker image for tensorboard (default "registry.cn-zhangjiakou.aliyuncs.com/tensorflow-samples/tensorflow:1.12.0-devel")
--toleration strings tolerate some k8s nodes with taints,usage: "--toleration taint-key" or "--toleration all"
--workers int the worker number to run the distributed training. (default 1)
--working-dir string working directory to extract the code. If using syncMode, the $workingDir/code contains the code (default "/root")
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job
--pprof enable cpu profile
--trace enable trace
- arena submit - Submit a training job.