Skip to content

Commit

Permalink
Merge pull request #2 from 4paradigm/legacy-enabled
Browse files Browse the repository at this point in the history
Legacy enabled
  • Loading branch information
archlitchi authored Oct 11, 2021
2 parents 0c55cba + 6392243 commit aeadfaf
Show file tree
Hide file tree
Showing 6 changed files with 304 additions and 264 deletions.
203 changes: 111 additions & 92 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,17 +11,18 @@ English version|[中文版](README_cn.md)

- [About](#about)
- [When to use](#when-to-use)
- [Prerequisites](#prerequisites)
- [Quick Start](#quick-start)
- [Preparing your GPU Nodes](#preparing-your-gpu-nodes)
- [Enabling vGPU Support in Kubernetes](#enabling-vGPU-support-in-kubernetes)
- [Running GPU Jobs](#running-gpu-jobs)
- [Stategy](#strategy)
- [Benchmarks](#Benchmarks)
- [Features](#Features)
- [Experimental Features](#Experimental-Features)
- [Known Issues](#Known-Issues)
- [TODO](#TODO)
- [Prerequisites](#prerequisites)
- [Quick Start](#quick-start)
- [Preparing your GPU Nodes](#preparing-your-gpu-nodes)
- [Enabling vGPU Support in Kubernetes](#enabling-vGPU-support-in-kubernetes)
- [Running GPU Jobs](#running-gpu-jobs)
- [Uninstall](#Uninstall)
- [Tests](#Tests)
- [Issues and Contributing](#issues-and-contributing)
Expand All @@ -38,86 +39,6 @@ The **k8s vGPU scheduler** is based on 4pd-k8s-device-plugin ([4paradigm/k8s-dev
4. Situations that require a large number of small GPUs, such as teaching scenarios where one GPU is provided for multiple students to use, and the cloud platform provides small GPU instances.
5. In the case of insufficient physical device memory, virtual device memory can be turned on, such as training of large batches and large models.

## Scheduling

Current schedule strategy is to select GPU with lowest task, thus balance the loads across mutiple GPUs

## Benchmarks

Three instances from ai-benchmark have been used to evaluate vGPU-device-plugin performance as follows

| Test Environment | description |
| ---------------- | :------------------------------------------------------: |
| Kubernetes version | v1.12.9 |
| Docker version | 18.09.1 |
| GPU Type | Tesla V100 |
| GPU Num | 2 |

| Test instance | description |
| ------------- | :---------------------------------------------------------: |
| nvidia-device-plugin | k8s + nvidia k8s-device-plugin |
| vGPU-device-plugin | k8s + VGPU k8s-device-plugin,without virtual device memory |
| vGPU-device-plugin(virtual device memory) | k8s + VGPU k8s-device-plugin,with virtual device memory |

Test Cases:

| test id | case | type | params |
| ------- | :-----------: | :-------: | :---------------------: |
| 1.1 | Resnet-V2-50 | inference | batch=50,size=346*346 |
| 1.2 | Resnet-V2-50 | training | batch=20,size=346*346 |
| 2.1 | Resnet-V2-152 | inference | batch=10,size=256*256 |
| 2.2 | Resnet-V2-152 | training | batch=10,size=256*256 |
| 3.1 | VGG-16 | inference | batch=20,size=224*224 |
| 3.2 | VGG-16 | training | batch=2,size=224*224 |
| 4.1 | DeepLab | inference | batch=2,size=512*512 |
| 4.2 | DeepLab | training | batch=1,size=384*384 |
| 5.1 | LSTM | inference | batch=100,size=1024*300 |
| 5.2 | LSTM | training | batch=10,size=1024*300 |

Test Result: ![img](./imgs/benchmark_inf.png)

![img](./imgs/benchmark_train.png)

To reproduce:

1. install vGPU-nvidia-device-plugin,and configure properly
2. run benchmark job

```
$ kubectl apply -f benchmarks/ai-benchmark/ai-benchmark.yml
```

3. View the result by using kubctl logs

```
$ kubectl logs [pod id]
```

## Features

- Specify the number of vGPUs divided by each physical GPU.
- Limit vGPU's Device Memory.
- Allows vGPU allocation by specifying device memory
- Limit vGPU's Streaming Multiprocessor.
- Allows vGPU allocation by specifying device core usage
- Zero changes to existing programs

## Experimental Features

- Virtual Device Memory

The device memory of the vGPU can exceed the physical device memory of the GPU. At this time, the excess part will be put in the RAM, which will have a certain impact on the performance.

## Known Issues

- Currently, A100 MIG not supported
- Currently, only computing tasks are supported, and video codec processing is not supported.

## TODO

- Support video codec processing
- Support Multi-Instance GPUs (MIG)

## Prerequisites

The list of prerequisites for running the NVIDIA device plugin is described below:
Expand Down Expand Up @@ -172,24 +93,42 @@ Then, you need to label your GPU nodes which can be scheduled by 4pd-k8s-schedul
kubectl label nodes {nodeid} gpu=on
```

### Enabling vGPU Support in Kubernetes
### Download

Once you have configured the options above on all the GPU nodes in your
cluster, remove existing NVIDIA device plugin for Kubernetes if it already exists. Then, you need to clone our project, and enter deployments folder

```
$ git clone https://github.com/4paradigm/k8s-vgpu-scheduler.git
$ cd k8s-vgpu/deployments
$ cd k8s-vgpu-scheduler/deployments
```

### Set scheduler image version

Check your kubernetes version by the using the following command

```
kubectl version
```

Then you need to set the kubernetes scheduler image version according to your kubernetes server version key `scheduler.kubeScheduler.image` in `deployments/values.yaml` file , for example, if your cluster server version is 1.16.8, then you should change image version to 1.16.8

```
scheduler:
kubeScheduler:
image: "registry.cn-hangzhou.aliyuncs.com/google_containers/kube-scheduler:v1.16.8"
```

In the deployments folder, you can customize your vGPU support by modifying following values in `values.yaml/devicePlugin/extraArgs` :
### Enabling vGPU Support in Kubernetes

In the deployments folder, you can customize your vGPU support by modifying following keys `devicePlugin.extraArgs` in `values.yaml` file:

* `device-memory-scaling:`
Float type, by default: 1. The ratio for NVIDIA device memory scaling, can be greater than 1 (enable virtual device memory, experimental feature). For NVIDIA GPU with *M* memory, if we set `device-memory-scaling` argument to *S*, vGPUs splitted by this GPU will totaly get *S \* M* memory in Kubernetes with our device plugin.
Float type, by default: 1. The ratio for NVIDIA device memory scaling, can be greater than 1 (enable virtual device memory, experimental feature). For NVIDIA GPU with *M* memory, if we set `device-memory-scaling` argument to *S*, vGPUs splitted by this GPU will totaly get `S * M` memory in Kubernetes with our device plugin.
* `device-split-count:`
Integer type, by default: equals 10. Maxinum tasks assigned to a simple GPU device.

Besides, you can customize the follwing values in `values.yaml/scheduler/extender/extraArgs`:
Besides, you can customize the follwing keys `devicePlugin.extraArgs` in `values.yaml` file`:

* `default-mem:`
Integer type, by default: 5000. The default device memory of the current task, in MB
Expand All @@ -199,16 +138,16 @@ Besides, you can customize the follwing values in `values.yaml/scheduler/extende
After configure those optional arguments, you can enable the vGPU support by following command:

```
$ helm install vgpu vgpu
$ helm install vgpu vgpu -n kube-system
```

You can verify your install by following command:

```
$ kubectl get pods
$ kubectl get pods -n kube-system
```

If the following two pods `vgpu-device-plugin` and `vgpu-scheduler` are in running state, then your installation is successful.
If the following two pods `vgpu-device-plugin` and `vgpu-scheduler` are in *Running* state, then your installation is successful.

### Running GPU Jobs

Expand Down Expand Up @@ -252,6 +191,86 @@ $ helm install vgpu vgpu -n kube-system
helm uninstall vgpu -n kube-system
```

## Scheduling

Current schedule strategy is to select GPU with lowest task, thus balance the loads across mutiple GPUs

## Benchmarks

Three instances from ai-benchmark have been used to evaluate vGPU-device-plugin performance as follows

| Test Environment | description |
| ---------------- | :------------------------------------------------------: |
| Kubernetes version | v1.12.9 |
| Docker version | 18.09.1 |
| GPU Type | Tesla V100 |
| GPU Num | 2 |

| Test instance | description |
| ------------- | :---------------------------------------------------------: |
| nvidia-device-plugin | k8s + nvidia k8s-device-plugin |
| vGPU-device-plugin | k8s + VGPU k8s-device-plugin,without virtual device memory |
| vGPU-device-plugin(virtual device memory) | k8s + VGPU k8s-device-plugin,with virtual device memory |

Test Cases:

| test id | case | type | params |
| ------- | :-----------: | :-------: | :---------------------: |
| 1.1 | Resnet-V2-50 | inference | batch=50,size=346*346 |
| 1.2 | Resnet-V2-50 | training | batch=20,size=346*346 |
| 2.1 | Resnet-V2-152 | inference | batch=10,size=256*256 |
| 2.2 | Resnet-V2-152 | training | batch=10,size=256*256 |
| 3.1 | VGG-16 | inference | batch=20,size=224*224 |
| 3.2 | VGG-16 | training | batch=2,size=224*224 |
| 4.1 | DeepLab | inference | batch=2,size=512*512 |
| 4.2 | DeepLab | training | batch=1,size=384*384 |
| 5.1 | LSTM | inference | batch=100,size=1024*300 |
| 5.2 | LSTM | training | batch=10,size=1024*300 |

Test Result: ![img](./imgs/benchmark_inf.png)

![img](./imgs/benchmark_train.png)

To reproduce:

1. install vGPU-nvidia-device-plugin,and configure properly
2. run benchmark job

```
$ kubectl apply -f benchmarks/ai-benchmark/ai-benchmark.yml
```

3. View the result by using kubctl logs

```
$ kubectl logs [pod id]
```

## Features

- Specify the number of vGPUs divided by each physical GPU.
- Limit vGPU's Device Memory.
- Allows vGPU allocation by specifying device memory
- Limit vGPU's Streaming Multiprocessor.
- Allows vGPU allocation by specifying device core usage
- Zero changes to existing programs

## Experimental Features

- Virtual Device Memory

The device memory of the vGPU can exceed the physical device memory of the GPU. At this time, the excess part will be put in the RAM, which will have a certain impact on the performance.

## Known Issues

- Currently, A100 MIG not supported
- Currently, only computing tasks are supported, and video codec processing is not supported.

## TODO

- Support video codec processing
- Support Multi-Instance GPUs (MIG)

## Tests

- TensorFlow 1.14.0/2.4.1
Expand Down
Loading

0 comments on commit aeadfaf

Please sign in to comment.