data
: Contains some sample datadev_notebooks
: Contains all Jupyter notebook files which are used for rapid prototypingdoc
: Contains diverse documentation filesinfra
: Contains all files which are used in order to create the environment for training deep learning models on the LRZ AI Systemsrc
: Contains all python source files of the projecttests
: Contains all test files in order to perform unit testing
The overall goal of this project is about developing an algorithm using deep learning, which is capable of predicting whether a 3D-model is printable or not. In this context, printable refers to the context of additive manufacturing.
- Goal Definition
- Goal Measurment
- How is the algorithm going to be consumed in the end?
- Technical requirements (Expected Results, Cosntraints)?
In a first step, public available datasets (i.e. open source data) are used. To be specific, the following to ressources are considered as the main source:
- Describe the ETL (Extract, Transform, Load)
- Describe result of the Data Exploration
- What kind of data?
- Which range?
- Sample Data
- How is the data cleaned?
- Is all data used in the end?
- Does the Data match the requirements (from problem description)?
- Thingi10K
- ABC Dataset
- LRZ AI System
- Docker
- Enroot
For elaborate information about the infrastruture and how to use it, please consult section about LRZ AI System
.
- PyTorch
- Numpy
- Matplotlib
-
Request only the number of GPUs you are using (i.e. do not allocate more than 1 GPU if you are not performing parallel training using horovod)
-
If your training has not mix precision enabled, please select only these resources with P100 cards
-
The time limit is restricted to max. 24 hours, so if the job is running longer that the available time slot, save checkpoints frequently
First, make sure that you've established a VPN tunnel to the LRZ. This can be done using any VPN client (e.g. Cisco AnyConnect) and providing the domain https://asa-cluster.lrz.de/
once you're asked for that. Afterwards, use your TUM credentials to login.
Once the VPN connection is established successfully, use SSH in order to login to the AI-System.
$ ssh [email protected]
A overview of the available GPU instances within the LRZ AI System is given below:
One of the most important information provided by the table above is the "SLURM Partition" as this information is needed in order to submit the job to the desired machine.
This section is about providing some useful commands in order to use SLURM on the AI System.
Show all available queues
$ sinfo
Show reservation of queus (if there is any)
$ squeue
Create allocation of ressources
$ salloc
Example:
$ salloc -p dgx-1-p100 --ntasks=8 --gres=gpu:8
Cancel allocation of ressources
$ scancel <JobID>
Run job
$ srun
Example:
$ srun --pty enroot start --mount ./data:/mnt/data ubuntu.sqush bash
Submit batch job into SLURM pipeline
$ sbatch
Example:
$ sbatch script.sbatch
The LRZ AI System is a container based solution. This means, in order to make use of any GPU instance, we have to first deploy a container on the AI System from which a SLURM job can be submitted.
To this end, enroot
container runtime is used, in order to deploy a conatainer on the LRZ AI System, as users are not granted root access to the file system. In contrast to Docker
, using enroot
means we are able to start a rootless container, wherein we have root permissions, but not outside of it.
The following steps show how to deploy a container on the LRZ AI System:
- Import a container from a repository (e.g. Docker Hub)
$ enroot import docker://ubuntu
The result of this step is an image which is called "ubuntu.sqsh".
- Start an enroot container using the recently created image
$ enroot start ubuntu.sqsh
The current data storage within the LRZ AI System is resticted to the following properties:
- Disk quota: 150 GB
- File quota: 200.000
If this needs to be enlarged, a service request has to be created at the LRZ service desk.
Generally, there are two different ways in submitting jobs to the SLURM pipeine:
- Interactive Jobs (i.e. interact with the terminal)
- Batch Jobs (i.e. no interaction with the terminal)
Interactive Jobs Using this option enables to get a console output of the job which is submitted to the SLURM pipeline. For example, if we submit a job which is about training a Neural Network, we would see the training progress in the CLI. This is basically done by fist allocating a ressource followed by submitting a containerized job (for help, see SLURM commands section above).
Batch Jobs Using this option does not provide a console output. However, within the .sbatch script it can be declared where to write the output and errors. A sample script is provided below:
#!/bin/bash
#SBATCH -N 1
#SBATCH -p dgx
#SBATCH --gres=gpu:8
#SBATCH --ntasks=8
#SBATCH -o enroot_test.out
#SBATCH -e enroot_test.err
srun --container-mounts=./data-test:/mnt/data-test --container-image='horovod/horovod+0.16.4-tf1.12.0-torch1.1.0-mxnet1.4.1-py3.5' \
python script.py --epochs 55 --batch-size 512
Once the deep learning container has been started on the LRZ AI system, you'll get a similar output to the one which is presented right below (please be patient, this might take some seconds):
Cloning into 'TUM-DI-LAB'...
remote: Enumerating objects: 3151, done.
remote: Counting objects: 100% (233/233), done.
remote: Compressing objects: 100% (172/172), done.
remote: Total 3151 (delta 122), reused 135 (delta 61), pack-reused 2918
Receiving objects: 100% (3151/3151), 108.13 MiB | 6.50 MiB/s, done.
Resolving deltas: 100% (2100/2100), done.
Updating files: 100% (145/145), done.
Branch 'dev' set up to track remote branch 'dev' from 'origin'.
Switched to a new branch 'dev'
entrypoint.sh: line 9: cd: /workspace/mount_dir/: No such file or directory
root@9fdaa7f9b957:/workspace# [2021-06-23 09:06:13 +0000] [339] [INFO] Starting gunicorn 20.1.0
[2021-06-23 09:06:13 +0000] [339] [INFO] Listening at: http://0.0.0.0:5000 (339)
[2021-06-23 09:06:13 +0000] [339] [INFO] Using worker: sync
[2021-06-23 09:06:13 +0000] [341] [INFO] Booting worker with pid: 341
Things which are happening here are:
- The latest version of the TUM-DI-LAB github repository gets cloned into the container
- The working branch is changed to dev
- The MLflow UI gets started as a background process (please do not kill this job, otherwise UI is not accessible anymore)
After you are facing a similar output as shown above, please hit ENTER, in ordert to get to the well known bash CLI.
From there on, please follow the corresponding steps in order to open the MLflow UI on your local web browser:
- Get to know the ip-address of the machine you are working on. Therefore execute the follwoing command:
$ ifconfig
Please look for the ip-adress which starts with 10.XXX.XXX.XXX
- Start a web browser on your local machine and open the socket comprised of the ip-adress and the port 5000. A respective example can be found below:
http://10.195.15.242:5000/
Define most important terms and expressions.
Term | Definition |
---|---|
NN | Neural Network |
CLI | Command Line Interface |
Name | |
---|---|
Bouziane, Nouhayla | [email protected] |
Ebid, Ahmed | [email protected] |
Srinivas, Aditya Sai | [email protected] |
Bok, Felix | [email protected] |
Kiechle, Johannes | [email protected] |
Lux, Kerstin (Supervisor TUM) | [email protected] |
Cesur, Guelce (Supervisor VW) | [email protected] |
Kleshchonok, Andrii (Supervisor VW) | [email protected] |
Danielz, Marcus (Supervisor VW) | [email protected] |