Config files for setting up Multitenant Kubeflow on AWS with spot instances Repo contains supporting code for How we reduced our ML training costs by 78%
- An EKS cluster with Kubernetes 1.14 on AWS
- Autoscaling with Nodegroup autodiscovery enabled
- GPU nodes
- With scale-down-to-zero at no workload
- Spot Instance purchase enabled by default
- Kubeflow 1.0.1 running on the cluster with only GPU requesting resources running on GPU nodes
# setup environment
export ENVIRONMENT=staging
export AWS_PROFILE=<your profile>
source envs/$ENVIRONMENT/variables.sh
# Create cluster
eksctl create cluster -f envs/$ENVIRONMENT/cluster-spec.yml
kubectl cluster-info # to check if the cluster is connected
# set executable
chmod a+x *.sh
# Deploy Kubeflow
./deploy_kubeflow.sh
- CLI Programmatic Access Keys
- Keys to manipulate resources on AWS
- eksctl
- To create the cluster
- aws-cli
- eksctl dependency
- aws-iam-authenticator
- eksctl dependency
- kubectl
- To manage the kubernetes cluster
- helm3
- To deploy helm charts
The cluster that gets spun up will have the following specs:
- ng-1
- m5a.2xlarge
- min nodes: 0
- max nodes: 3
- vol: 100 GB
- ng-2
- m5a.2xlarge
- min: 0
- max: 10
- vol: 20 GB
- 1-gpu-spot-p2-xlarge
- p2.xlarge
- min nodes: 0
- max nodes: 10
- max price: $1.2
- 1-gpu-spot-p3-2xlarge
- p3.2xlarge
- min nodes: 0
- max nodes: 10
- max price: $1.2
- 4-gpu-spot-p3-8xlarge
- p3.8xlarge
- min nodes: 0
- max nodes: 4
- max price: OnDemand
- 8-gpu-spot-p3dn-24xlarge -- Disabled by default
- p3dn.24xlarge
- min nodes: 0
- max nodes: 1
- max price: $11