Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add job supervisor #915

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
0.9.2
0.9.3
9 changes: 7 additions & 2 deletions arena-artifacts/Chart.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
apiVersion: v2
name: arena-artifacts
description: A Helm chart for installing arena dependencies
description: A Helm chart for installing arena dependencies

# A chart can be either an 'application' or a 'library' chart.
#
Expand Down Expand Up @@ -41,7 +41,7 @@ dependencies:
condition: cron.enabled,global.cron.enabled
- name: et-operator
alias: et
version: 0.1.0
version: 0.1.1
repository: "@et-operator"
condition: et.enabled,global.et.enabled
- name: mpi-operator
Expand All @@ -59,3 +59,8 @@ dependencies:
version: 0.1.0
repository: "@gpu-exporter"
condition: exporter.enabled,global.exporter.enabled
- name: job-supervisor
alias: job-supervisor
version: 0.1.0
repository: "@job-supervisor"
condition: job-supervisor.enabled,global.job-supervisor.enabled
Original file line number Diff line number Diff line change
Expand Up @@ -155,11 +155,12 @@ spec:
description: ReplicaStatuses is map of ReplicaType and ReplicaStatus,
specifies the status of each replica.
type: object
restartCount:
description: The number of times the Job has been restarted
format: int32
type: integer
startTime:
description: Represents time when the job was acknowledged by the
job controller. It is not guaranteed to be set in happens-before
order across separate operations. It is represented in RFC3339 form
and is in UTC.
description: Represents time when the job was acknowledged by the job controller. It is not guaranteed to be set in happens-before order across separate operations. It is represented in RFC3339 form and is in UTC.
format: date-time
type: string
toDeletePods:
Expand All @@ -170,6 +171,7 @@ spec:
required:
- conditions
- replicaStatuses
- restartCount
type: object
type: object
served: true
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -156,16 +156,18 @@ spec:
description: ReplicaStatuses is map of ReplicaType and ReplicaStatus,
specifies the status of each replica.
type: object
restartCount:
description: The number of times the Job has been restarted
format: int32
type: integer
startTime:
description: Represents time when the job was acknowledged by the
job controller. It is not guaranteed to be set in happens-before
order across separate operations. It is represented in RFC3339 form
and is in UTC.
description: Represents time when the job was acknowledged by the job controller. It is not guaranteed to be set in happens-before order across separate operations. It is represented in RFC3339 form and is in UTC.
format: date-time
type: string
required:
- conditions
- replicaStatuses
- restartCount
type: object
type: object
served: true
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,12 @@ spec:
spec:
description: TrainingJobSpec defines the desired state of TrainingJob
properties:
backoffLimit:
default: 6
description: Optional number of retries to execute script.
format: int32
minimum: 0
type: integer
cleanPodPolicy:
description: CleanPodPolicy defines the policy that whether to kill
pods after the job completes. Defaults to None.
Expand Down Expand Up @@ -13170,6 +13176,14 @@ spec:
description: Specifies the mode when launcher attach to workers. available
option is ssh / kubexec Defaults is kubexec.
type: string
restartPolicy:
default: Never
description: Restart policy for training job One of OnFailure, Never.
Default to Never.
enum:
- Never
- OnFailure
type: string
slotsPerWorker:
description: Specifies the number of slots per worker used in hostfile.
Defaults to 1.
Expand Down Expand Up @@ -13258,6 +13272,10 @@ spec:
description: ReplicaStatuses is map of ReplicaType and ReplicaStatus,
specifies the status of each replica.
type: object
restartCount:
description: The number of times the Job has been restarted
format: int32
type: integer
startTime:
description: Represents time when the job was acknowledged by the
job controller. It is not guaranteed to be set in happens-before
Expand All @@ -13272,6 +13290,7 @@ spec:
required:
- conditions
- replicaStatuses
- restartCount
type: object
type: object
served: true
Expand Down
4 changes: 2 additions & 2 deletions arena-artifacts/charts/et-operator/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,10 @@ type: application
# This is the chart version. This version number should be incremented each time you make changes
# to the chart and its templates, including the app version.
# Versions are expected to follow Semantic Versioning (https://semver.org/)
version: 0.1.0
version: 0.1.1

# This is the version number of the application being deployed. This version number should be
# incremented each time you make changes to the application. Versions are not expected to
# follow Semantic Versioning. They should reflect the version the application is using.
# It is recommended to use it with quotes.
appVersion: "v0.1.0"
appVersion: "v0.1.1"
23 changes: 23 additions & 0 deletions arena-artifacts/charts/job-supervisor/.helmignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Patterns to ignore when building packages.
# This supports shell glob matching, relative path matching, and
# negation (prefixed with !). Only one pattern per line.
.DS_Store
# Common VCS dirs
.git/
.gitignore
.bzr/
.bzrignore
.hg/
.hgignore
.svn/
# Common backup files
*.swp
*.bak
*.tmp
*.orig
*~
# Various IDEs
.project
.idea/
*.tmproj
.vscode/
24 changes: 24 additions & 0 deletions arena-artifacts/charts/job-supervisor/Chart.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
apiVersion: v2
name: job-supervisor
description: A Helm chart for Kubernetes

# A chart can be either an 'application' or a 'library' chart.
#
# Application charts are a collection of templates that can be packaged into versioned archives
# to be deployed.
#
# Library charts provide useful utilities or functions for the chart developer. They're included as
# a dependency of application charts to inject those utilities and functions into the rendering
# pipeline. Library charts do not define any templates and therefore cannot be deployed.
type: application

# This is the chart version. This version number should be incremented each time you make changes
# to the chart and its templates, including the app version.
# Versions are expected to follow Semantic Versioning (https://semver.org/)
version: 0.1.0

# This is the version number of the application being deployed. This version number should be
# incremented each time you make changes to the application. Versions are not expected to
# follow Semantic Versioning. They should reflect the version the application is using.
# It is recommended to use it with quotes.
appVersion: "v1.0.0-aliyun"
49 changes: 49 additions & 0 deletions arena-artifacts/charts/job-supervisor/templates/deployment.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: job-supervisor
{{- include "arena.labels" . | nindent 4 }}
name: job-supervisor
namespace: {{ .Release.Namespace }}
spec:
replicas: 1
selector:
matchLabels:
app: job-supervisor
{{- include "arena.labels" . | nindent 6 }}
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
labels:
{{- include "arena.labels" . | nindent 8 }}
app: job-supervisor
spec:
nodeSelector:
{{- include "arena.nodeSelector" . | nindent 8 }}
{{- include "arena.nonEdgeNodeSelector" . | nindent 8 }}
tolerations:
{{- include "arena.tolerateNonEdgeNodeSelector" . | nindent 6 }}
containers:
- command:
- /job-supervisor
image: {{ include "arena.imagePrefix" . }}/{{ .Values.image }}:{{ .Values.tag }}
imagePullPolicy: {{ .Values.imagePullPolicy }}
name: job-supervisor
resources:
limits:
cpu: 300m
memory: 500Mi
requests:
cpu: 100m
memory: 300Mi
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
serviceAccount: job-supervisor
serviceAccountName: job-supervisor
terminationGracePeriodSeconds: 30
112 changes: 112 additions & 0 deletions arena-artifacts/charts/job-supervisor/templates/rbac.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@

apiVersion: v1
kind: ServiceAccount
metadata:
name: job-supervisor
namespace: {{ .Release.Namespace }}
labels:
{{- include "arena.labels" . | nindent 4 }}

---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: job-supervisor
labels:
{{- include "arena.labels" . | nindent 4 }}
rules:
- apiGroups:
- ""
resources:
- configmaps
- endpoints
- events
- namespaces
- serviceaccounts
- secrets
- persistentvolumeclaims
- pods
- pods/log
- pods/exec
- services
- nodes
verbs:
- '*'
- apiGroups:
- ""
- apps
- extensions
resources:
- deployments
- daemonsets
- replicasets
- statefulsets
verbs:
- '*'
- apiGroups:
- rbac.authorization.k8s.io
resources:
- roles
- rolebindings
- clusterroles
- clusterrolebindings
verbs:
- '*'
- apiGroups:
- apiextensions.k8s.io
resources:
- customresourcedefinitions
verbs:
- '*'
- apiGroups:
- apps.kubedl.io
resources:
- '*'
verbs:
- '*'
- apiGroups:
- kubeflow.org
resources:
- '*'
verbs:
- '*'
- apiGroups:
- kai.alibabacloud.com
resources:
- '*'
verbs:
- '*'
- apiGroups:
- batch
resources:
- jobs
verbs:
- '*'
- apiGroups:
- storage.k8s.io
resources:
- storageclasses
verbs:
- '*'
- apiGroups:
- tensorflow.org
resources:
- '*'
verbs:
- '*'
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: job-supervisor
namespace: {{ .Release.Namespace }}
labels:
{{- include "arena.labels" . | nindent 4 }}
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: job-supervisor
subjects:
- kind: ServiceAccount
name: job-supervisor
namespace: {{ .Release.Namespace }}
3 changes: 3 additions & 0 deletions arena-artifacts/charts/job-supervisor/values.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Default values for job-supervisor
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.
Loading