Skip to content

Latest commit

 

History

History
62 lines (51 loc) · 3.6 KB

arena_submit_etjob.md

File metadata and controls

62 lines (51 loc) · 3.6 KB

arena submit etjob

Submit ETJob as training job.

Synopsis

Submit ETJob as training job.

arena submit etjob [flags]

Options

  -a, --annotation strings          the annotations, usage: "--annotation=key=value" or "--annotation key=value"
      --config-file strings         giving configuration files when submiting jobs,usage:"--config-file <host_path_file>:<container_path_file>"
      --cpu string                  the cpu resource to use for the training, like 1 for 1 core.
  -d, --data strings                specify the datasource to mount to the job, like <name_of_datasource>:<mount_point_on_job>
      --data-dir strings            the data dir. If you specify /data, it means mounting hostpath /data into container path /data
      --device stringArray          specify the chip vendors and count that used for resources, such as amd.com/gpu=1 gpu.intel.com/i915=1.
  -e, --env strings                 the environment variables
      --gang                        enable gang scheduling
      --gpus int                    the GPU count of each worker to run the training.
  -h, --help                        help for etjob
      --image string                the docker image name of training job
      --image-pull-secret strings   giving names of imagePullSecret when you want to use a private registry, usage:"--image-pull-secret <name1>"
      --logdir string               the training logs dir, default is /training_logs (default "/training_logs")
      --max-workers int             the max worker number to run the distributed training. (default 1000)
      --memory string               the memory resource to use for the training, like 1Gi.
      --min-workers int             the min worker number to run the distributed training. (default 1)
      --name string                 override name
  -p, --priority string             priority class name
      --rdma                        enable RDMA
      --retry int                   retry times.
      --selector strings            assigning jobs to some k8s particular nodes, usage: "--selector=key=value" or "--selector key=value" 
      --sync-image string           the docker image of syncImage
      --sync-mode string            syncMode: support rsync, hdfs, git
      --sync-source string          sync-source: for rsync, it's like 10.88.29.56::backup/data/logoRecoTrain.zip; for git, it's like https://github.com/kubeflow/tf-operator.git
      --tensorboard                 enable tensorboard
      --tensorboard-image string    the docker image for tensorboard (default "registry.cn-zhangjiakou.aliyuncs.com/tensorflow-samples/tensorflow:1.12.0-devel")
      --toleration strings          tolerate some k8s nodes with taints,usage: "--toleration taint-key" or "--toleration all" 
      --workers int                 the worker number to run the distributed training. (default 1)
      --working-dir string          working directory to extract the code. If using syncMode, the $workingDir/code contains the code (default "/root")

Options inherited from parent commands

      --arena-namespace string   The namespace of arena system service, like tf-operator (default "arena-system")
      --config string            Path to a kube config. Only required if out-of-cluster
      --loglevel string          Set the logging level. One of: debug|info|warn|error (default "info")
  -n, --namespace string         the namespace of the job
      --pprof                    enable cpu profile
      --trace                    enable trace

SEE ALSO

Auto generated by spf13/cobra on 5-Mar-2021