Skip to content

Latest commit

 

History

History
662 lines (554 loc) · 41.9 KB

aws_k8s.md

File metadata and controls

662 lines (554 loc) · 41.9 KB

studio-go-runner AWS support

This document details the installation of the studio go runner within an AWS managed EKS Kubernetes cluster. After completing the Kubernetes installation using these instructions please return to the main README.md file to continue.

If you are interested in using CPU deployments with attached EBS volumes the README at examples/aws/cpu/README.md will be of interest.

Prerequisites

  • Install and configure the AWS Command Line Interface (AWS CLI):
  • Install eksctl.
  • Load the AWS SQS Credentials
  • Deploy the runner

Install eksctl (AWS only)

If you are using azure or GCP then options such as acs-engine, and skaffold are natively supported by the cloud vendors. These tools are also readily customizable, and maintained and so these are recommended.

For AWS the eksctl tool is now considered the official tool for the EKS CLI. A full set of instructions for the installation of eksctl can be found at,https://docs.aws.amazon.com/eks/latest/userguide/getting-started-eksctl.html. In brief form eksctl can be installed using the following steps:

pip install awscli --upgrade --user
curl --silent --location "https://github.com/weaveworks/eksctl/releases/latest/download/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp
sudo rm /usr/local/bin/eksctl
sudo mv /tmp/eksctl /usr/local/bin/eksctl
sudo apt-get install jq

One requirement of using eksctl is that you must first subscribe to the AMI that will be used with your GPU EC2 instances. The subscription can be found at, https://aws.amazon.com/marketplace/pp/B07GRHFXGM.

Install the AWS authenticator for Kubernetes

EKS clusters can be accessed using AWS IAM when generating the kubeconfig file. This is documented at https://docs.aws.amazon.com/eks/latest/userguide/create-kubeconfig.html in the "Create kubeconfig Manually" section. An existing EKS config can be used to create a skeleton for the IAM based file.

curl -o aws-iam-authenticator https://amazon-eks.s3.us-west-2.amazonaws.com/1.19.6/2021-01-05/bin/linux/amd64/aws-iam-authenticator
chmod +x ./aws-iam-authenticator
mkdir -p $HOME/.local/bin && cp ./aws-iam-authenticator $HOME/.local/bin/aws-iam-authenticator && export PATH=$PATH:$HOME/.local/bin

AWS Cloud support for Kubernetes 1.19.x and GPU

This section discusses the use of eksctl to provision a working k8s cluster onto which the gpu runner can be deployed.

The use of AWS EC2 machines requires that the AWS account has had an EC2 key Pair imported from your administration machine, or created in order that machines created using eksctl can be accessed. More information can be found at https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html.

In order to make use of StudioML environment variable based templates you should export the AWS environment variables. While doing this you should also synchronize your system clock as this is a common source of authentication issues with AWS.

export AWS_ACCOUNT=`aws sts get-caller-identity --query Account --output text`
aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin $AWS_ACCOUNT.dkr.ecr.us-west-2.amazonaws.com
export AWS_ACCESS_KEY_ID=xxx
export AWS_SECRET_ACCESS_KEY=xxx
export AWS_DEFAULT_REGION=xxx
sudo ntpdate ntp.ubuntu.com
export KUBECONFIG=~/.kube/config
export AWS_CLUSTER_NAME=test-$USER

The cluster creation options are set using a yaml file, this example uses examples/aws/cluster.yaml which you should modify prior to use:


eksctl create cluster -f <(stencil -input examples/aws/cluster.yaml)
2021-04-01 19:11:08 [ℹ]  eksctl version 0.44.0
2021-04-01 19:11:08 [ℹ]  using region us-west-2
2021-04-01 19:11:08 [ℹ]  subnets for us-west-2a - public:192.168.0.0/19 private:192.168.96.0/19
2021-04-01 19:11:08 [ℹ]  subnets for us-west-2b - public:192.168.32.0/19 private:192.168.128.0/19
2021-04-01 19:11:08 [ℹ]  subnets for us-west-2d - public:192.168.64.0/19 private:192.168.160.0/19
2021-04-01 19:11:09 [ℹ]  nodegroup "overhead" will use "ami-07429ae6ce65be89a" [AmazonLinux2/1.19]
2021-04-01 19:11:09 [ℹ]  using SSH public key "/home/kmutch/.ssh/id_rsa.pub" as "eksctl-test-eks-nodegroup-overhead-be:07:a0:27:44:d8:27:04:c2:ba:28:fa:8c:47:7f:09"
2021-04-01 19:11:09 [ℹ]  nodegroup "1-gpu-spot-p2-xlarge" will use "ami-01f2fad57776fe43f" [AmazonLinux2/1.19]
2021-04-01 19:11:09 [ℹ]  using SSH public key "/home/kmutch/.ssh/id_rsa.pub" as "eksctl-test-eks-nodegroup-1-gpu-spot-p2-xlarge-be:07:a0:27:44:d8:27:04:c2:ba:28:fa:8c:47:7f:09"
2021-04-01 19:11:09 [ℹ]  using Kubernetes version 1.19
2021-04-01 19:11:09 [ℹ]  creating EKS cluster "test-eks" in "us-west-2" region with un-managed nodes
2021-04-01 19:11:09 [ℹ]  2 nodegroups (1-gpu-spot-p2-xlarge, overhead) were included (based on the include/exclude rules)
2021-04-01 19:11:09 [ℹ]  will create a CloudFormation stack for cluster itself and 2 nodegroup stack(s)
2021-04-01 19:11:09 [ℹ]  will create a CloudFormation stack for cluster itself and 0 managed nodegroup stack(s)
2021-04-01 19:11:09 [ℹ]  if you encounter any issues, check CloudFormation console or try 'eksctl utils describe-stacks --region=us-west-2 --cluster=test-eks'
2021-04-01 19:11:09 [ℹ]  Kubernetes API endpoint access will use default of {publicAccess=true, privateAccess=false} for cluster "test-eks" in "us-west-2"
2021-04-01 19:11:09 [ℹ]  2 sequential tasks: { create cluster control plane "test-eks", 3 sequential sub-tasks: { 3 sequential sub-tasks: { wait for control plane to become ready, tag cluster, update CloudWatch logging configuration }, create addons, 2 parallel sub-tasks: { create nodegroup "overhead", create nodegroup "1-gpu-spot-p2-xlarge" } } }
2021-04-01 19:11:09 [ℹ]  building cluster stack "eksctl-test-eks-cluster"
2021-04-01 19:11:10 [ℹ]  deploying stack "eksctl-test-eks-cluster"
2021-04-01 19:11:40 [ℹ]  waiting for CloudFormation stack "eksctl-test-eks-cluster"
2021-04-01 19:12:10 [ℹ]  waiting for CloudFormation stack "eksctl-test-eks-cluster"
...
2021-04-01 19:23:10 [ℹ]  waiting for CloudFormation stack "eksctl-test-eks-cluster"
2021-04-01 19:24:10 [ℹ]  waiting for CloudFormation stack "eksctl-test-eks-cluster"
2021-04-01 19:24:11 [✔]  tagged EKS cluster (environment=test-eks)
2021-04-01 19:24:12 [ℹ]  waiting for requested "LoggingUpdate" in cluster "test-eks" to succeed
2021-04-01 19:24:29 [ℹ]  waiting for requested "LoggingUpdate" in cluster "test-eks" to succeed
2021-04-01 19:24:46 [ℹ]  waiting for requested "LoggingUpdate" in cluster "test-eks" to succeed
2021-04-01 19:24:46 [✔]  configured CloudWatch logging for cluster "test-eks" in "us-west-2" (enabled types: audit, authenticator, controllerManager &
disabled types: api, scheduler)
2021-04-01 19:24:46 [ℹ]  building nodegroup stack "eksctl-test-eks-nodegroup-1-gpu-spot-p2-xlarge"
2021-04-01 19:24:46 [ℹ]  building nodegroup stack "eksctl-test-eks-nodegroup-overhead"
2021-04-01 19:24:47 [ℹ]  deploying stack "eksctl-test-eks-nodegroup-1-gpu-spot-p2-xlarge"
2021-04-01 19:24:47 [ℹ]  waiting for CloudFormation stack "eksctl-test-eks-nodegroup-1-gpu-spot-p2-xlarge"
2021-04-01 19:24:47 [ℹ]  deploying stack "eksctl-test-eks-nodegroup-overhead"
2021-04-01 19:24:47 [ℹ]  waiting for CloudFormation stack "eksctl-test-eks-nodegroup-overhead"
2021-04-01 19:25:02 [ℹ]  waiting for CloudFormation stack "eksctl-test-eks-nodegroup-1-gpu-spot-p2-xlarge"
2021-04-01 19:25:07 [ℹ]  waiting for CloudFormation stack "eksctl-test-eks-nodegroup-overhead"
2021-04-01 19:25:21 [ℹ]  waiting for CloudFormation stack "eksctl-test-eks-nodegroup-1-gpu-spot-p2-xlarge"
2021-04-01 19:25:25 [ℹ]  waiting for CloudFormation stack "eksctl-test-eks-nodegroup-overhead"
...
2021-04-01 19:27:55 [ℹ]  waiting for the control plane availability...
2021-04-01 19:27:55 [✔]  saved kubeconfig as "/home/kmutch/.kube/config"
2021-04-01 19:27:55 [ℹ]  as you are using a GPU optimized instance type you will need to install NVIDIA Kubernetes device plugin.
2021-04-01 19:27:55 [ℹ]          see the following page for instructions: https://github.com/NVIDIA/k8s-device-plugin
2021-04-01 19:27:55 [ℹ]  no tasks
2021-04-01 19:27:55 [✔]  all EKS cluster resources for "test-eks" have been created
2021-04-01 19:27:55 [ℹ]  adding identity "arn:aws:iam::613076437200:role/eksctl-test-eks-nodegroup-overhea-NodeInstanceRole-1SJ5R46STPRJK" to auth ConfigMap
2021-04-01 19:27:55 [ℹ]  adding identity "arn:aws:iam::613076437200:role/eksctl-test-eks-nodegroup-1-gpu-s-NodeInstanceRole-12WIJDK3B3AZO" to auth ConfigMap
2021-04-01 19:27:58 [ℹ]  kubectl command should work with "/home/kmutch/.kube/config", try 'kubectl get nodes'
2021-04-01 19:27:58 [✔]  EKS cluster "test-eks" in "us-west-2" region is ready

eksctl is written in Go uses CloudFormation internally and supports the use of YAML resources to define deployments, more information can be found at https://eksctl.io/.

When creating a cluster the credentials will be loaded into your ~/.kube/config file automatically. When using the AWS service oriented method of deployment the normally visible master will not be displayed as a node.

The next step is to install the auto scaler that Kubernetes offers. The auto scaler is installed using the following step:


kubectl apply -f <(stencil -input examples/aws/autoscaler.yaml)
serviceaccount/cluster-autoscaler created
clusterrole.rbac.authorization.k8s.io/cluster-autoscaler created
role.rbac.authorization.k8s.io/cluster-autoscaler created
clusterrolebinding.rbac.authorization.k8s.io/cluster-autoscaler created
rolebinding.rbac.authorization.k8s.io/cluster-autoscaler created
deployment.apps/cluster-autoscaler created

GPU Setup

In order to activate GPU support within the non A100 worker clusters a daemon set instance needs to be created that will mediate between the kubernetes plugin and the GPU resources available to pods, as shown in the following command.


kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta6/nvidia-device-plugin.yml
daemonset.apps/nvidia-device-plugin-daemonset created

If you are making use of the A100 NVIDIA cards in a cluster then helm should be used instead as follows:


sudo snap install helm --classic
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo add nvgfd https://nvidia.github.io/gpu-feature-discovery
helm repo update
helm install \
    --version=0.9.0 \
    --generate-name \
    --set compatWithCPUManager=true \
    --set resources.requests.cpu=100m \
    --set resources.limits.memory=512Mi \
    --set migStrategy=single \
    nvdp/nvidia-device-plugin
helm install \
    --version=0.4.1 \
    --generate-name \
    --set migStrategy=single \
    nvgfd/gpu-feature-discovery

The A100 devices also have support for additional features that you might wish to read futher about at the following, https://docs.aws.amazon.com/eks/latest/userguide/node-efa.html, and https://docs.aws.amazon.com/eks/latest/userguide/cni-upgrades.html.

Information about MIG support for the A100 cards is found in a google doc at, https://docs.google.com/document/d/1mdgMQ8g7WmaI_XVVRrCvHPFPOMCm5LQD5JefgAh6N8g/edit.

Machines when first started will have an allocatable resource named nvidia.com/gpu. When this resource flips from 0 to 1 the machine has become available for GPU work. The plugin yaml added will cause a container to be bootstrapped into new nodes to perform the installation of the drivers etc.

You will be able to run the following command after the Cluster Smoke testing has started to identify the new node added by the auto scaler.


kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"
NAME                                         GPU
ip-192-168-5-16.us-west-2.compute.internal   1

Cluster Smoke Testing

A test pod for validating the GPU functionality can be created using the following commands:

$ cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: tf-gpu
spec:
  containers:
  - name: gpu
    image: 763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:2.3.1-gpu-py37-cu110-ubuntu18.04
    imagePullPolicy: IfNotPresent
    command: ["/bin/sh", "-c"]
    args: ["sleep 10000"]
    resources:
      limits:
        memory: 1024Mi
        # ^ Set memory in case default limits are set low
        nvidia.com/gpu: 1 # requesting 1 GPUs
  tolerations:
  # This toleration will allow the gpu hook to run anywhere
  #   By default this is permissive in case you have tainted your GPU nodes.
  - operator: "Exists"
EOF

Once the pod has been added the auto scaler log will display output indicating that a new node is required to fullfill the work:

$ kubectl get pods --namespace kube-system
NAME                                   READY   STATUS    RESTARTS   AGE
aws-node-9rh9k                         1/1     Running   0          3d1h
aws-node-rjdgm                         1/1     Running   0          3d1h
cluster-autoscaler-6446d7bf4f-brvw5    1/1     Running   0          59m
coredns-6548845887-9r4kz               1/1     Running   0          3d1h
coredns-6548845887-fdkd9               1/1     Running   0          3d1h
kube-proxy-ll6jp                       1/1     Running   0          3d1h
kube-proxy-x44pm                       1/1     Running   0          3d1h
nvidia-device-plugin-daemonset-kgskl   1/1     Running   0          58m
nvidia-device-plugin-daemonset-lcdr7   1/1     Running   0          58m
$ kubectl logs --namespace kube-system cluster-autoscaler-6446d7bf4f-brvw5
...
I0405 19:59:41.604787       1 static_autoscaler.go:229] Starting main loop
I0405 19:59:41.605694       1 filter_out_schedulable.go:65] Filtering out schedulables
I0405 19:59:41.605720       1 filter_out_schedulable.go:132] Filtered out 0 pods using hints
I0405 19:59:41.605806       1 filter_out_schedulable.go:170] 0 pods were kept as unschedulable based on caching
I0405 19:59:41.605818       1 filter_out_schedulable.go:171] 0 pods marked as unschedulable can be scheduled.
I0405 19:59:41.605834       1 filter_out_schedulable.go:82] No schedulable pods
I0405 19:59:41.605910       1 klogx.go:86] Pod default/tf-gpu is unschedulable
I0405 19:59:41.605960       1 scale_up.go:364] Upcoming 0 nodes
I0405 19:59:41.606130       1 scale_up.go:288] Pod tf-gpu can't be scheduled on eksctl-test-eks-nodegroup-overhead-NodeGroup-18WE8ZI39VZF7, predicate checking error: Insufficient nvidia.com/gpu; predicateName=NodeResourcesFit; reasons: Insufficient nvidia.com/gpu; debugInfo=
I0405 19:59:41.606157       1 scale_up.go:437] No pod can fit to eksctl-test-eks-nodegroup-overhead-NodeGroup-18WE8ZI39VZF7
I0405 19:59:41.606171       1 waste.go:57] Expanding Node Group eksctl-test-eks-nodegroup-1-gpu-spot-p2-xlarge-NodeGroup-165ZZ5GD15VO2 would waste 100.00% CPU, 98.36% Memory, 99.18% Blended
I0405 19:59:41.606205       1 scale_up.go:456] Best option to resize: eksctl-test-eks-nodegroup-1-gpu-spot-p2-xlarge-NodeGroup-165ZZ5GD15VO2
I0405 19:59:41.606220       1 scale_up.go:460] Estimated 1 nodes needed in eksctl-test-eks-nodegroup-1-gpu-spot-p2-xlarge-NodeGroup-165ZZ5GD15VO2
I0405 19:59:41.606258       1 scale_up.go:574] Final scale-up plan: [{eksctl-test-eks-nodegroup-1-gpu-spot-p2-xlarge-NodeGroup-165ZZ5GD15VO2 0->1 (max: 10)}]
I0405 19:59:41.606287       1 scale_up.go:663] Scale-up: setting group eksctl-test-eks-nodegroup-1-gpu-spot-p2-xlarge-NodeGroup-165ZZ5GD15VO2 size to 1
I0405 19:59:41.606373       1 auto_scaling_groups.go:219] Setting asg eksctl-test-eks-nodegroup-1-gpu-spot-p2-xlarge-NodeGroup-165ZZ5GD15VO2 size to 1
I0405 19:59:41.606673       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"e3eb5ed4-7962-4017-94d2-dc5d71963440", APIVersion:"v1", ResourceVersion:"736946", FieldPath:""}): type: 'Normal' reason: 'ScaledUpGroup' Scale-up: setting group eksctl-test-eks-nodegroup-1-gpu-spot-p2-xlarge-NodeGroup-165ZZ5GD15VO2 size to 1
I0405 19:59:41.757570       1 eventing_scale_up_processor.go:47] Skipping event processing for unschedulable pods since there is a ScaleUp attempt this loop
I0405 19:59:41.758074       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"tf-gpu", UID:"ab39c253-cd0b-4670-8ee5-3122e8ad6db1", APIVersion:"v1", ResourceVersion:"736857", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{eksctl-test-eks-nodegroup-1-gpu-spot-p2-xlarge-NodeGroup-165ZZ5GD15VO2 0->1 (max: 10)}]
...

The new node is added resulting in

$ kubectl get nodes
NAME                                           STATUS   ROLES    AGE     VERSION
ip-192-168-27-155.us-west-2.compute.internal   Ready    <none>   2m16s   v1.19.6-eks-49a6c0
ip-192-168-3-184.us-west-2.compute.internal    Ready    <none>   3d1h    v1.19.6-eks-49a6c0
ip-192-168-4-192.us-west-2.compute.internal    Ready    <none>   3d1h    v1.19.6-eks-49a6c0
$ kubectl get pods
NAME     READY   STATUS              RESTARTS   AGE
tf-gpu   0/1     ContainerCreating   0          5m47s

If the new node does not appear, and the auto scaler log shows the tf-gpu pod is to be scheduled on the cloudformation template results there can be a number of causes. The message that indicates cloud formation has been invoked to add the node will appear as follows:

I0409 13:57:25.371521       1 filter_out_schedulable.go:157] Pod default.tf-gpu marked as unschedulable can be scheduled on node template-node for-eksctl-test-eks-nodegroup-1-gpu-spot-p2-xlarge-NodeGroup-114XB1S03EMHG-8505906760983331750-0. Ignoring in scale up.

Scaling activities can be obtained using the following commands to assist in diagnosing what is occuring within the scaler:


aws autoscaling describe-auto-scaling-groups | jq -r '..|.AutoScalingGroupName?' |grep eksctl-test-eks-nodegroup-1-gpu-spot-p2-xlarge
eksctl-test-eks-nodegroup-1-gpu-spot-p2-xlarge-NodeGroup-HKH7E4GCQ3GP
aws autoscaling describe-scaling-activities --auto-scaling-group-name eksctl-test-eks-nodegroup-1-gpu-spot-p2-xlarge-NodeGroup-HKH7E4GCQ3GP | jq '.Activities[0]'
{
  "ActivityId": "96f5e2df-604d-9ad0-1598-2672c388e498",
  "AutoScalingGroupName": "eksctl-test-eks-nodegroup-1-gpu-spot-p2-xlarge-NodeGroup-HKH7E4GCQ3GP",
  "Description": "Launching a new EC2 instance.  Status Reason: Could not launch Spot Instances. UnfulfillableCapacity - There is no capacity availabl
e that matches your request. Launching EC2 instance failed.",
  "Cause": "At 2021-04-09T18:14:53Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity f
rom 0 to 1.",
  "StartTime": "2021-04-09T18:14:54.566Z",
  "EndTime": "2021-04-09T18:14:54Z",
  "StatusCode": "Failed",
  "StatusMessage": "Could not launch Spot Instances. UnfulfillableCapacity - There is no capacity available that matches your request. Launching EC2 i
nstance failed.",
  "Progress": 100,
  "Details": "{\"Subnet ID\":\"subnet-0853b684808f1ad07\",\"Availability Zone\":\"us-west-2a\"}",
  "AutoScalingGroupARN": "arn:aws:autoscaling:us-west-2:...:autoScalingGroup:74dd6499-6426-488f-98eb-35e5bea961cc:autoScalingGroupName/eksctl
-test-eks-nodegroup-1-gpu-spot-p2-xlarge-NodeGroup-HKH7E4GCQ3GP"
}

The jq command was used to select the first, or latest scaling acitivity. The failed scaling attempt was due to the availability zones specified having no capacity. TGhe fix would be to modify the node group definition inside the cluster.yaml file and redeploy the cluster in zones that have availability of the instance types being used.

Once the pod is in a running state you should be able to test the access to the GPU cards using the following commands:


kubectl get pods
NAME     READY   STATUS    RESTARTS   AGE
tf-gpu   1/1     Running   0          2m31s
 kubectl exec -it tf-gpu -- \
  python -c 'from tensorflow.python.client import device_lib; print(device_lib.list_local_devices())'
2021-04-05 20:09:20.487509: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.
2021-04-05 20:09:20.487672: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:105] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped.
2021-04-05 20:09:20.494959: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0
2021-04-05 20:09:20.530896: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.
2021-04-05 20:09:22.160495: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2300010000 Hz
2021-04-05 20:09:22.160965: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x557b6ed99900 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-04-05 20:09:22.161038: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2021-04-05 20:09:22.164066: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2021-04-05 20:09:22.310055: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-04-05 20:09:22.311028: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x557b6ee20470 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2021-04-05 20:09:22.311068: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla K80, Compute Capability 3.7
2021-04-05 20:09:22.311348: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-04-05 20:09:22.312171: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:00:1e.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
2021-04-05 20:09:22.312236: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0
2021-04-05 20:09:22.315893: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.11
2021-04-05 20:09:22.317468: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2021-04-05 20:09:22.317876: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2021-04-05 20:09:22.321294: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2021-04-05 20:09:22.322155: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.11
2021-04-05 20:09:22.322412: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.8
2021-04-05 20:09:22.322565: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-04-05 20:09:22.323432: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-04-05 20:09:22.324228: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2021-04-05 20:09:22.324287: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0
2021-04-05 20:09:22.772479: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-04-05 20:09:22.772539: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      0
2021-04-05 20:09:22.772563: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0:   N
2021-04-05 20:09:22.772849: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-04-05 20:09:22.773761: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-04-05 20:09:22.774565: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/device:GPU:0 with 10623 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 10414284085485766931
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 12659882986103904376
physical_device_desc: "device: XLA_CPU device"
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 4671966972074686993
physical_device_desc: "device: XLA_GPU device"
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 11139760768
locality {
  bus_id: 1
  links {
  }
}
incarnation: 4261672894508981255
physical_device_desc: "device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7"
]
kubectl exec -it tf-gpu -- nvidia-smi
Mon Apr  5 20:08:27 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   31C    P8    32W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
kubectl delete pod tf-gpu
pod "tf-gpu" deleted

Once the pod is deleted the auto scaler will begin to scale down, node scaling events happen after 10 minutes of inactivity on the nodes:

I0405 20:21:48.936908       1 static_autoscaler.go:229] Starting main loop
I0405 20:21:48.937342       1 taints.go:77] Removing autoscaler soft taint when creating template from node
I0405 20:21:48.937626       1 filter_out_schedulable.go:65] Filtering out schedulables
I0405 20:21:48.937649       1 filter_out_schedulable.go:132] Filtered out 0 pods using hints
I0405 20:21:48.937657       1 filter_out_schedulable.go:170] 0 pods were kept as unschedulable based on caching
I0405 20:21:48.937664       1 filter_out_schedulable.go:171] 0 pods marked as unschedulable can be scheduled.
I0405 20:21:48.937679       1 filter_out_schedulable.go:82] No schedulable pods
I0405 20:21:48.937710       1 static_autoscaler.go:402] No unschedulable pods
I0405 20:21:48.937731       1 static_autoscaler.go:449] Calculating unneeded nodes
I0405 20:21:48.937782       1 scale_down.go:421] Node ip-192-168-27-155.us-west-2.compute.internal - nvidia.com/gpu utilization 0.000000
I0405 20:21:48.937821       1 scale_down.go:487] Scale-down calculation: ignoring 2 nodes unremovable in the last 5m0s
I0405 20:21:48.937924       1 static_autoscaler.go:492] ip-192-168-27-155.us-west-2.compute.internal is unneeded since 2021-04-05 20:11:45.759037653 +0000 UTC m=+4195.405515037 duration 10m3.177802803s
I0405 20:21:48.937958       1 static_autoscaler.go:503] Scale down status: unneededOnly=false lastScaleUpTime=2021-04-05 19:59:41.604745396 +0000 UTC m=+3471.251222503 lastScaleDownDeleteTime=2021-04-05 19:02:12.392306441 +0000 UTC m=+22.038783488 lastScaleDownFailTime=2021-04-05 19:02:12.392308118 +0000 UTC m=+22.038785370 scaleDownForbidden=false isDeleteInProgress=false scaleDownInCooldown=false
I0405 20:21:48.937980       1 static_autoscaler.go:516] Starting scale down
I0405 20:21:48.938035       1 scale_down.go:790] ip-192-168-27-155.us-west-2.compute.internal was unneeded for 10m3.177802803s
I0405 20:21:48.938072       1 scale_down.go:1053] Scale-down: removing empty node ip-192-168-27-155.us-west-2.compute.internal
I0405 20:21:48.938274       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"e3eb5ed4-7962-4017-94d2-dc5d71963440", APIVersion:"v1", ResourceVersion:"741638", FieldPath:""}): type: 'Normal' reason: 'ScaleDownEmpty' Scale-down: removing empty node ip-192-168-27-155.us-west-2.compute.internal
I0405 20:21:48.951429       1 delete.go:103] Successfully added ToBeDeletedTaint on node ip-192-168-27-155.us-west-2.compute.internal
I0405 20:21:49.206708       1 auto_scaling_groups.go:277] Terminating EC2 instance: i-01da70a9349280c94
I0405 20:21:49.206738       1 aws_manager.go:297] Some ASG instances might have been deleted, forcing ASG list refresh
I0405 20:21:49.283533       1 auto_scaling_groups.go:351] Regenerating instance to ASG map for ASGs: [eksctl-test-eks-nodegroup-1-gpu-spot-p2-xlarge-NodeGroup-165ZZ5
GD15VO2 eksctl-test-eks-nodegroup-overhead-NodeGroup-18WE8ZI39VZF7]
I0405 20:21:49.397556       1 auto_scaling.go:199] 2 launch configurations already in cache
I0405 20:21:49.397810       1 aws_manager.go:269] Refreshed ASG list, next refresh after 2021-04-05 20:22:49.397803109 +0000 UTC m=+4859.044280298
I0405 20:21:49.397981       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-192-168-27-155.us-west-2.compute.interna
l", UID:"14287283-b1d8-4c7f-8b3f-2d0d66581467", APIVersion:"v1", ResourceVersion:"741465", FieldPath:""}): type: 'Normal' reason: 'ScaleDown' node removed by cluster autoscaler

After this the node will be marked as NotReady and shortly after the node will disappear:

$ kubectl get nodes
NAME                                           STATUS     ROLES    AGE    VERSION
ip-192-168-27-155.us-west-2.compute.internal   NotReady   <none>   22m    v1.19.6-eks-49a6c0
ip-192-168-3-184.us-west-2.compute.internal    Ready      <none>   3d1h   v1.19.6-eks-49a6c0
ip-192-168-4-192.us-west-2.compute.internal    Ready      <none>   3d1h   v1.19.6-eks-49a6c0
$ kubectl get nodes
NAME                                          STATUS   ROLES    AGE    VERSION
ip-192-168-3-184.us-west-2.compute.internal   Ready    <none>   3d1h   v1.19.6-eks-49a6c0
ip-192-168-4-192.us-west-2.compute.internal   Ready    <none>   3d1h   v1.19.6-eks-49a6c0

It is also possible to use the stock nvidia docker images to perform tests as well, for example:

$ cat << EOF | kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
  name: nvidia-smi
spec:
  restartPolicy: OnFailure
  containers:
  - name: nvidia-smi
    image: nvidia/cuda:latest
    args:
    - "nvidia-smi"
    resources:
      limits:
        nvidia.com/gpu: 1
EOF
pod/nvidia-smi created
$ kubectl logs nvidia-smi
Thu Apr  2 20:03:44 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   44C    P8    27W / 149W |      0MiB / 11441MiB |      2%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
$ kubectl delete pod nvidia-smi
pod "nvidia-smi" deleted

Load the AWS SQS Credentials

When using AWS deployments have the choice of making use of the AWS hosted RabbitMQ offering or using AWS SQS queuing.

AWS RabbitMQ

Rabbit MQ is available within AWS as a managed service. The message broker can be made publically access or to remain within the confines of your VPC, if you decide to make it publically accessible then you should modify the --no-publicly-accessible option. and can be configured within your AWS account using the following command:

$ export RMQ_BROKER=test-rmq
$ export RMQ_ADMIN_PASSWORD=admin_password
$ export RMQ_ADMIN_USER=admin
$ aws mq create-broker  --host-instance-type mq.m5.large --broker-name $RMQ_BROKER --engine-version 3.8.6 \
--deployment-mode SINGLE_INSTANCE --engine-type RABBITMQ  --no-publicly-accessible \
--tags "Owner=Karl Mutch" --users ConsoleAccess=true,Groups=administrator,Password=$RMQ_ADMIN_PASSWORD,Username=$RMQ_ADMIN_USER
$ export AWS_RMQ_ID=`aws mq list-brokers | jq '.BrokerSummaries[] | select(.BrokerName=="test-rmq") | .BrokerId' -r`

At this point it will take 5 or more minutes for the RMQ cluster to start so be sure to either check the AWS management console or run the aws mq list-brokers command checking to see when the broker has moved into the running state. Once the broker is running use the following commands to get the host details for the broker.

export RMQ_URL=`aws mq describe-broker --broker-id $AWS_RMQ_ID | jq -r ".BrokerInstances[0].Endpoints[0]"`
export RMQ_MANAGEMENT_URL=`aws mq describe-broker --broker-id $AWS_RMQ_ID | jq -r ".BrokerInstances[0].ConsoleURL"`

# extract the protocol
rmq_proto="$(echo $RMQ_URL | grep :// | sed -e's,^\(.*://\).*,\1,g')"

# remove the protocol -- updated
rmq_url=$(echo $RMQ_URL | sed -e s,$rmq_proto,,g)

# extract the user (if any)
ignore_user="$(echo $rmq_url | grep @ | cut -d@ -f1)"

# extract the host and port -- updated
rmq_hostport=$(echo $rmq_url | sed -e s,$ignore_user@,,g | cut -d/ -f1)

# by request host without port
rmq_host="$(echo $rmq_hostport | sed -e 's,:.*,,g')"
# by request - try to extract the port
rmq_port="$(echo $rmq_hostport | sed -e 's,^.*:,:,g' -e 's,.*:\([0-9]*\).*,\1,g' -e 's,[^0-9],,g')"

# extract the path (if any)
rmq_path="$(echo $rmq_url | grep / | cut -d/ -f2-)"
export RMQ_FULL_URL="$rmq_proto$RMQ_ADMIN_USER:$RMQ_ADMIN_PASSWORD@$rmq_hostport/$rmq_path"
export AMQP_URL=$RMQ_FULL_URL

# extract the protocol
rmq_proto="$(echo $RMQ_MANAGEMENT_URL | grep :// | sed -e's,^\(.*://\).*,\1,g')"

# remove the protocol -- updated
rmq_url=$(echo $RMQ_MANAGEMENT_URL | sed -e s,$rmq_proto,,g)

# extract the user (if any)
ignore_user="$(echo $rmq_url | grep @ | cut -d@ -f1)"

# extract the host and port -- updated
rmq_hostport=$(echo $rmq_url | sed -e s,$ignore_user@,,g | cut -d/ -f1)

# by request host without port
rmq_host="$(echo $rmq_hostport | sed -e 's,:.*,,g')"
# by request - try to extract the port
rmq_port="$(echo $rmq_hostport | sed -e 's,^.*:,:,g' -e 's,.*:\([0-9]*\).*,\1,g' -e 's,[^0-9],,g')"

# extract the path (if any)
rmq_path="$(echo $rmq_url | grep / | cut -d/ -f2-)"
export RMQ_ADMIN_URL="$rmq_proto$RMQ_ADMIN_USER:$RMQ_ADMIN_PASSWORD@$rmq_hostport/$rmq_path"
export RMQ_ADMIN_BARE_URL="$rmq_proto$rmq_hostport/$rmq_path"
export AMQP_ADMIN=$RMQ_ADMIN_URL

You can now use a command such as the following to get a list of queues as a test:

curl -s -i -u $RMQ_ADMIN_USER:$RMQ_ADMIN_PASSWORD ${RMQ_ADMIN_URL}api/queues

Information concerning the AWS RabbitMQ offering can be found at, https://aws.amazon.com/blogs/aws/amazon-mq-update-new-rabbitmq-message-broker-service/.

In order to use the AWS MQ references in the runners you should set the ConfigMap entries, AMQP_URL, and AMQP_MGT_URL to the values in RMQ_FULL_URL, and RMQ_ADMIN_URL respectively.

AWS SQS

In order to deploy the runner SQS credentials will need to be injected into the EKS cluster. A default section must existing within the AWS credentials files, this will be the one selected by the runner. Using the following we can inject all of our known AWS credentials etc into the SQS secrets, this will not always be the best practice and you will need to determine how you will manage these credentials.

aws_sqs_cred=`cat ~/.aws/credentials | base64 -w 0`
aws_sqs_config=`cat ~/.aws/config | base64 -w 0`
kubectl apply -f <(cat <<EOF
apiVersion: v1
kind: Secret
metadata:
  name: studioml-runner-aws-sqs
type: Opaque
data:
  credentials: $aws_sqs_cred
  config: $aws_sqs_config
EOF
)

When the deployment or job yaml is kubectl applied a set of mount points are included that will map these secrets from the etcd based secrets store for your cluster into the runner containers automatically.

Deployment of the runner

AWS based runners come either as time limited Kubernetes jobs on spot instances through to On Demand EC2 instances with Kubernetes Deployments using the runner as a long lived daemon.

The template for a Job can be found at examples/aws/sqs_job.yaml. The example file is set to allow the job to run for 10 minutes if there are now available tasks that can be pulled from SQS and then exit. The Job will continue to run as many tasks as it can until the idle time is met. You can also specify a job count that upon being met will result in the job stopping.

kubectl delete job studioml-go-runner
kubectl apply -f <(stencil -input examples/aws/sqs_job.yaml)

A template for deployment can be found at examples/aws/deployment.yaml. The template depends on the environment variables that have been described throughout this document.

kubectl apply -f <(stencil -input examples/aws/deployment.yaml)

Be aware that any person, or entity having access to the kubernetes vault can extract these secrets unless extra measures are taken to first encrypt the secrets before injecting them into the cluster. For more information as to how to used secrets hosted through the file system on a running k8s container please refer to, https://kubernetes.io/docs/concepts/configuration/secret/#using-secrets-as-files-from-a-pod.

Task submission using StudioML python client

The studioml python client can be used to submit tasks to the runners. When using AWS each artifact defined within the ~/.studioml/config.yaml file is defined with a credentials section as shown below:

database:
    type: s3
    endpoint: http://s3-us-west-2.amazonaws.com
    bucket: "karl-mutch-metadata"
    credentials:
        aws:
            access_key: AKGY5QHFTQNVKLUY
            secret_access_key: "eF/ey3kdfoaQddfvwrwt3kdpxmpphnGDFJHxghwtqmrCpZB"

storage:
    type: s3
    endpoint: http://s3-us-west-2.amazonaws.com
    bucket: "karl-mutch-rmq"
    credentials:
        aws:
            access_key: AKGY5QHFTQNVKLUY
            secret_access_key: "eF/ey3kdfoaQddfvwrwt3kdpxmpphnGDFJHxghwtqmrCpZB"

Using the sqs_ prefix on a queue name will then allow the request to be sent using AWS SQS.

cd ~/studio/examples/keras
pip install keras tensorflow
studio run --lifetime=30m --max-duration=20m --gpus 1 --queue=sqs_StudioML_kmutch --force-git train_mnist_keras.py
...
2021-04-12 15:14:25 INFO   studio-runner - studio run: submitted experiment 1618265665_dd2d03ff-dfb6-4f26-9996-a0ed2e40468f
2021-04-12 15:14:25 INFO   studio-runner - Added 1 experiment(s) in 0 seconds to queue sqs_kmutch

Manually accessing cluster master APIs

In order to retrieve the Kubernetes API Bearer token you can use the following command:

kops get secrets --type secret admin -oplaintext

Access for the administrative API can be exposed using one of the two following commands:

kops get secrets kube -oplaintext
kubectl config view --minify

More information concerning the kubelet security can be found at, https://github.com/kubernetes/kops/blob/master/docs/security.md#kubelet-api.

If you wish to pass the ability to manage your cluster to another person, or wish to migrate running the dashboard using a browser on another machine you can using the kops export command to pass a kubectl configuration file around, take care however as this will greatly increase the risk of a security incident if not done correctly. The configuration for accessing your cluster will be stored in your $KUBECONFIG file, defaulting to $HOME/.kube/config if not defined in your environment table.

If you wish to delete the cluster you can use the following command:

eksctl delete cluster -f <(stencil -input examples/aws/cluster.yaml) -w

Copyright © 2019-2021 Cognizant Digital Business, Evolutionary AI. All rights reserved. Issued under the Apache 2.0 license.