Skip to content

Commit 60218e4

Browse files
committed
Add workshop files.
0 parents  commit 60218e4

File tree

4 files changed

+279
-0
lines changed

4 files changed

+279
-0
lines changed

Dockerfile

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
FROM nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04
2+
3+
RUN apt-get update && apt-get install -y wget git python && apt-get clean && rm -rf /var/cache/apt
4+
RUN apt-get -y autoremove && apt-get -y autoclean
5+
RUN rm -rf /var/cache/apt
6+
7+
# install darknet and enable gpu
8+
RUN git clone https://github.com/pjreddie/darknet.git /darknet
9+
WORKDIR /darknet
10+
RUN sed -i s/GPU=0/GPU=1/g Makefile
11+
RUN sed -i s/CUDNN=0/CUDNN=1/g Makefile
12+
RUN sed -i s/OPENMP=0/OPENMP=1/g Makefile
13+
RUN make

README.md

Lines changed: 208 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,208 @@
1+
# BatchAI Workshop
2+
3+
Batch AI provides managed infrastructure to help data scientists with cluster
4+
management and scheduling, scaling, and monitoring of AI jobs.
5+
Batch AI works on top of `virtual machine scale sets` and `docker`.
6+
7+
Batch AI can run training jobs in docker containers or directly on the compute
8+
nodes.
9+
10+
## Batch AI
11+
12+
* Cluster
13+
* Jobs
14+
* Azure File Share - stdout, stderr, may contain python scripts
15+
* Azure Blob Storage - python scripts, data
16+
17+
![image](https://user-images.githubusercontent.com/7232635/38520388-aed7b5ec-3c10-11e8-84e2-39a0d1a17f81.png)
18+
19+
## Parallelizing Batch AI jobs
20+
21+
* Python train and test scripts define the `parallel strategy` used, **not Batch AI**.
22+
23+
For example,
24+
25+
* `CNTK` uses a `synchronous data parallel` training strategy
26+
* `Tensorflow` uses a `asynchronous model parallel` training strategy
27+
28+
## Note
29+
30+
* Make sure `.sh` scripts have `LF` endings - use `dos2unix` to fix
31+
* To enable faster communication between the nodes it´s necessary to use `Intel MPI` and have `InfiniBand` on the VM
32+
* `NC24r` (works with `Intel MPI` and `InfiniBand`) quota is `1 core` by default in any subscription, so make quota increase requests early
33+
* There's no reset ssh-key for nodes
34+
* Do **not** put `CMD` in the dockerfile used by Batch AI. Since the container runs in **detached mode**, it will exit on `CMD`
35+
* Error messages within the container are not very descriptive
36+
* Clusters take a long time to provision and deallocate
37+
38+
## Resources
39+
40+
* [Install Azure CLI 2.0 for WSL](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli-apt?view=azure-cli-latest)
41+
* [Batch AI Recipes](https://github.com/Azure/BatchAI/tree/master/recipes)
42+
* [Azure CLI Docs](https://github.com/Azure/BatchAI/blob/master/documentation/using-azure-cli-20.md)
43+
* [Swagger Docs for Batch AI](https://editor.swagger.io//?_ga=2.103282566.745803966.1523299917-1903704715.1523299917#/)
44+
* [Batch AI Environment Variables](https://github.com/Azure/BatchAI/blob/master/documentation/using-batchai-environment-variables.md)
45+
* [Setting up KeyVault](https://github.com/Azure/BatchAI/blob/master/documentation/using-azure-cli-20.md#using-keyvault-for-storing-secrets)
46+
47+
## Configure Azure CLI to use Batch AI
48+
49+
* [Azure CLI Configuration](https://github.com/Azure/BatchAI/blob/master/documentation/using-azure-cli-20.md#configuration)
50+
51+
## Set default subscription
52+
53+
```sh
54+
az account set -s <subscription id>
55+
az account list -o table
56+
```
57+
58+
## Create resource group
59+
60+
```sh
61+
az group create -n <rg name> -l eastus
62+
```
63+
64+
## Create a storage account
65+
66+
```sh
67+
az storage account create \
68+
-n <storage account name> \
69+
--sku Standard_LRS \
70+
-l eastus \
71+
-g <rg name>
72+
```
73+
74+
## Create a file share
75+
76+
```sh
77+
az storage share create \
78+
-n <share name> \
79+
--account-name <storage account name> \
80+
--account-key <storage account key>
81+
```
82+
83+
### Get storage account key of file share
84+
85+
```sh
86+
az storage account keys list \
87+
-n <storage account name> \
88+
-g <rg name> \
89+
--query "[0].value"
90+
```
91+
92+
## Create a directory in your file share to hold python scripts
93+
94+
```sh
95+
az storage directory create \
96+
-s <share name> \
97+
-n yolo \
98+
--account-name <storage account name> \
99+
--account-key <storage account key>
100+
```
101+
102+
## Upload python scripts to file share
103+
104+
```sh
105+
az storage file upload \
106+
-s <share name> \
107+
--source <python script> \
108+
-p yolo \
109+
--account-name <storage account name> \
110+
--account-key <storage account key>
111+
```
112+
113+
## Create cluster
114+
115+
### Create a cluster.json
116+
117+
Config parameters defined by `ClusterCreateParameters` in the [batch ai swagger docs](https://editor.swagger.io//?_ga=2.103282566.745803966.1523299917-1903704715.1523299917#/).
118+
119+
* [List of VM images](https://docs.microsoft.com/en-us/azure/batch/batch-linux-nodes#list-of-virtual-machine-images)
120+
121+
### Create cluster with cluster.json config
122+
123+
```sh
124+
az batchai cluster create \
125+
-n <cluster name> \
126+
-l eastus \
127+
-g <rg name> \
128+
-c cluster.json
129+
```
130+
131+
### Create cluster without cluster.json config
132+
133+
```sh
134+
az batchai cluster create \
135+
-n <cluster name> \
136+
-g <rg name> \
137+
-l eastus \
138+
--storage-account-name <storage account name> \
139+
--storage-account-key <storage account key> \
140+
-i UbuntuDSVM \
141+
-s Standard_NC6 \
142+
--min 2 \
143+
--max 2 \
144+
--afs-name <share name> \
145+
--afs-mount-path external \
146+
-u $USER \
147+
-k ~/.ssh/id_rsa.pub \
148+
-p <password>
149+
```
150+
151+
### View Cluster Status
152+
153+
```sh
154+
az batchai cluster show \
155+
-n <cluster name> \
156+
-g <rg name> \
157+
-o table
158+
```
159+
160+
## Create a job
161+
162+
### Create job.json
163+
164+
* View `JobBaseProperties` in the [batch ai swagger docs](https://editor.swagger.io//?_ga=2.103282566.745803966.1523299917-1903704715.1523299917#/) for the possible parameters to use in `job.json`.
165+
166+
```sh
167+
az batch ai job create \
168+
-g <rg name> \
169+
-l eastus \
170+
-n <job name> \
171+
-r <cluster name> \
172+
-c job.json
173+
```
174+
175+
### Monitor the job
176+
177+
```sh
178+
az batchai job show \
179+
-n <job name> \
180+
-g <rg name> \
181+
-o table
182+
```
183+
184+
### Stream job file output
185+
186+
```sh
187+
az batchai job stream-file \
188+
-j <job name> \
189+
-n stdout.txt \
190+
-d stdouterr \
191+
-g <rg name>
192+
```
193+
194+
## List ip and port of nodes in cluster
195+
196+
```sh
197+
az batchai cluster list-nodes \
198+
-n <cluster name> \
199+
-g <rg name>
200+
```
201+
202+
## SSH into the VM
203+
204+
```sh
205+
ssh <ip> -p <port>
206+
```
207+
208+
`$AZ_BATCHAI_MOUNT_ROOT` is an environment variable set by Batch AI for each job, it's value depends on the image used for nodes creation. For example, on Ubuntu based images it's equal to `/mnt/batch/tasks/shared/LS_root/mounts`. You can `cd` to this directory and view the python scripts and logs.

cluster.json

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
{
2+
"properties": {
3+
"vmSize": "Standard_NC6",
4+
"vmPriority": "dedicated",
5+
"scaleSettings": {
6+
"autoScale": {
7+
"minimumNodeCount": "2",
8+
"maximumNodeCount": "2"
9+
}
10+
},
11+
"virtualMachineConfiguration": {
12+
"publisher": "Canonical",
13+
"offer": "UbuntuServer",
14+
"sku": "16.04.0-LTS"
15+
},
16+
"nodeSetup": {
17+
"mountVolumes": {
18+
"azureFileShares": [{
19+
"accountName": "<file share account name>",
20+
"azureFileUrl": "https://<file share account name>.file.core.windows.net/<file share name>",
21+
"credentials": {
22+
"accountKey": "<storage account key>"
23+
},
24+
"relativeMountPath": "external"
25+
}]
26+
}
27+
},
28+
"userAccountSettings": {
29+
"adminUserName": "<admin username>",
30+
"adminUserSshPublicKey": "<base64 encoded RSA key>",
31+
"adminUserPassword": "<password>"
32+
}
33+
}
34+
}

job.json

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
{
2+
"$schema": "https://raw.githubusercontent.com/Azure/BatchAI/master/schemas/2017-09-01-preview/job.json",
3+
"properties": {
4+
"nodeCount": 2,
5+
"customToolkitSettings": {
6+
"commandLine": "cd /darknet && ./darknet detect $AZ_BATCHAI_INPUT_SCRIPT/cfg/yolov3.cfg $AZ_BATCHAI_INPUT_SCRIPT/yolov3.weights $AZ_BATCHAI_INPUT_SCRIPT/data/dog.jpg"
7+
},
8+
"stdOutErrPathPrefix": "$AZ_BATCHAI_MOUNT_ROOT/external",
9+
"inputDirectories": [{
10+
"id": "SCRIPT",
11+
"path": "$AZ_BATCHAI_MOUNT_ROOT/external/yolo"
12+
}],
13+
"outputDirectories": [{
14+
"id": "MODEL",
15+
"pathPrefix": "$AZ_BATCHAI_MOUNT_ROOT/external",
16+
"pathSuffix": "Models"
17+
}],
18+
"containerSettings": {
19+
"imageSourceRegistry": {
20+
"image": "smarker/yolo-darknet"
21+
}
22+
}
23+
}
24+
}

0 commit comments

Comments
 (0)