Merge pull request #2 from 4paradigm/legacy-enabled

Legacy enabled
4paradigm · Oct 11, 2021 · aeadfaf · aeadfaf
2 parents 0c55cba + 6392243
commit aeadfaf
Show file tree

Hide file tree

Showing 6 changed files with 304 additions and 264 deletions.
diff --git a/README.md b/README.md
@@ -11,17 +11,18 @@ English version|[中文版](README_cn.md)
 
 - [About](#about)
 - [When to use](#when-to-use)
+- [Prerequisites](#prerequisites)
+- [Quick Start](#quick-start)
+  - [Preparing your GPU Nodes](#preparing-your-gpu-nodes)
+  - [Enabling vGPU Support in Kubernetes](#enabling-vGPU-support-in-kubernetes)
+  - [Running GPU Jobs](#running-gpu-jobs)
 - [Stategy](#strategy)
 - [Benchmarks](#Benchmarks)
 - [Features](#Features)
 - [Experimental Features](#Experimental-Features)
 - [Known Issues](#Known-Issues)
 - [TODO](#TODO)
 - [Prerequisites](#prerequisites)
-- [Quick Start](#quick-start)
-  - [Preparing your GPU Nodes](#preparing-your-gpu-nodes)
-  - [Enabling vGPU Support in Kubernetes](#enabling-vGPU-support-in-kubernetes)
-  - [Running GPU Jobs](#running-gpu-jobs)
 - [Uninstall](#Uninstall)
 - [Tests](#Tests)
 - [Issues and Contributing](#issues-and-contributing)
@@ -38,86 +39,6 @@ The **k8s vGPU scheduler** is based on 4pd-k8s-device-plugin ([4paradigm/k8s-dev
 4. Situations that require a large number of small GPUs, such as teaching scenarios where one GPU is provided for multiple students to use, and the cloud platform provides small GPU instances.
 5. In the case of insufficient physical device memory, virtual device memory can be turned on, such as training of large batches and large models.
 
-## Scheduling
-
-Current schedule strategy is to select GPU with lowest task, thus balance the loads across mutiple GPUs
-
-## Benchmarks
-
-Three instances from ai-benchmark have been used to evaluate vGPU-device-plugin performance as follows
-
-| Test Environment | description                                              |
-| ---------------- | :------------------------------------------------------: |
-| Kubernetes version | v1.12.9                                                |
-| Docker  version    | 18.09.1                                                |
-| GPU Type           | Tesla V100                                             |
-| GPU Num            | 2                                                      |
-
-| Test instance |                         description                         |
-| ------------- | :---------------------------------------------------------: |
-| nvidia-device-plugin      |               k8s + nvidia k8s-device-plugin                |
-| vGPU-device-plugin        | k8s + VGPU k8s-device-plugin，without virtual device memory |
-| vGPU-device-plugin(virtual device memory) |  k8s + VGPU k8s-device-plugin，with virtual device memory   |
-
-Test Cases:
-
-| test id |     case      |   type    |         params          |
-| ------- | :-----------: | :-------: | :---------------------: |
-| 1.1     | Resnet-V2-50  | inference |  batch=50,size=346*346  |
-| 1.2     | Resnet-V2-50  | training  |  batch=20,size=346*346  |
-| 2.1     | Resnet-V2-152 | inference |  batch=10,size=256*256  |
-| 2.2     | Resnet-V2-152 | training  |  batch=10,size=256*256  |
-| 3.1     |    VGG-16     | inference |  batch=20,size=224*224  |
-| 3.2     |    VGG-16     | training  |  batch=2,size=224*224   |
-| 4.1     |    DeepLab    | inference |  batch=2,size=512*512   |
-| 4.2     |    DeepLab    | training  |  batch=1,size=384*384   |
-| 5.1     |     LSTM      | inference | batch=100,size=1024*300 |
-| 5.2     |     LSTM      | training  | batch=10,size=1024*300  |
-
-Test Result: ![img](./imgs/benchmark_inf.png)
-
-![img](./imgs/benchmark_train.png)
-
-To reproduce:
-
-1. install vGPU-nvidia-device-plugin，and configure properly
-2. run benchmark job
-
-```
-$ kubectl apply -f benchmarks/ai-benchmark/ai-benchmark.yml
-```
-
-3. View the result by using kubctl logs
-
-```
-$ kubectl logs [pod id]
-```
-
-## Features
-
-- Specify the number of vGPUs divided by each physical GPU.
-- Limit vGPU's Device Memory.
-- Allows vGPU allocation by specifying device memory 
-- Limit vGPU's Streaming Multiprocessor.
-- Allows vGPU allocation by specifying device core usage
-- Zero changes to existing programs
-
-## Experimental Features
-
-- Virtual Device Memory
-
-  The device memory of the vGPU can exceed the physical device memory of the GPU. At this time, the excess part will be put in the RAM, which will have a certain impact on the performance.
-
-## Known Issues
-
-- Currently, A100 MIG not supported 
-- Currently, only computing tasks are supported, and video codec processing is not supported.
-
-## TODO
-
-- Support video codec processing
-- Support Multi-Instance GPUs (MIG)
-
 ## Prerequisites
 
 The list of prerequisites for running the NVIDIA device plugin is described below:
@@ -172,24 +93,42 @@ Then, you need to label your GPU nodes which can be scheduled by 4pd-k8s-schedul
 kubectl label nodes {nodeid} gpu=on
 ```
 
-### Enabling vGPU Support in Kubernetes
+### Download
 
 Once you have configured the options above on all the GPU nodes in your
 cluster, remove existing NVIDIA device plugin for Kubernetes if it already exists. Then, you need to clone our project, and enter deployments folder
 
 ```
 $ git clone https://github.com/4paradigm/k8s-vgpu-scheduler.git
-$ cd k8s-vgpu/deployments
+$ cd k8s-vgpu-scheduler/deployments
+```
+
+### Set scheduler image version
+
+Check your kubernetes version by the using the following command
+
+```
+kubectl version
+```
+
+Then you need to set the kubernetes scheduler image version according to your kubernetes server version key `scheduler.kubeScheduler.image` in `deployments/values.yaml` file , for example, if your cluster server version is 1.16.8, then you should change image version to 1.16.8
+
+```
+scheduler:
+  kubeScheduler:
+    image: "registry.cn-hangzhou.aliyuncs.com/google_containers/kube-scheduler:v1.16.8"
 ```
 
-In the deployments folder, you can customize your vGPU support by modifying following values in `values.yaml/devicePlugin/extraArgs` :
+### Enabling vGPU Support in Kubernetes
+
+In the deployments folder, you can customize your vGPU support by modifying following keys `devicePlugin.extraArgs` in `values.yaml` file:
 
 * `device-memory-scaling:` 
-  Float type, by default: 1. The ratio for NVIDIA device memory scaling, can be greater than 1 (enable virtual device memory, experimental feature). For NVIDIA GPU with *M* memory, if we set `device-memory-scaling` argument to *S*, vGPUs splitted by this GPU will totaly get *S \* M* memory in Kubernetes with our device plugin.
+  Float type, by default: 1. The ratio for NVIDIA device memory scaling, can be greater than 1 (enable virtual device memory, experimental feature). For NVIDIA GPU with *M* memory, if we set `device-memory-scaling` argument to *S*, vGPUs splitted by this GPU will totaly get `S * M` memory in Kubernetes with our device plugin.
 * `device-split-count:` 
   Integer type, by default: equals 10. Maxinum tasks assigned to a simple GPU device.
 
-Besides, you can customize the follwing values in `values.yaml/scheduler/extender/extraArgs`:
+Besides, you can customize the follwing keys `devicePlugin.extraArgs` in `values.yaml` file`:
 
 * `default-mem:` 
   Integer type, by default: 5000. The default device memory of the current task, in MB
@@ -199,16 +138,16 @@ Besides, you can customize the follwing values in `values.yaml/scheduler/extende
 After configure those optional arguments, you can enable the vGPU support by following command:
 
 ```
-$ helm install vgpu vgpu
+$ helm install vgpu vgpu -n kube-system
 ```
 
 You can verify your install by following command:
 
 ```
-$ kubectl get pods
+$ kubectl get pods -n kube-system
 ```
 
-If the following two pods `vgpu-device-plugin` and `vgpu-scheduler` are in running state, then your installation is successful.
+If the following two pods `vgpu-device-plugin` and `vgpu-scheduler` are in *Running* state, then your installation is successful.
 
 ### Running GPU Jobs
 
@@ -252,6 +191,86 @@ $ helm install vgpu vgpu -n kube-system
 helm uninstall vgpu -n kube-system
 ```
 
+## Scheduling
+
+Current schedule strategy is to select GPU with lowest task, thus balance the loads across mutiple GPUs
+
+## Benchmarks
+
+Three instances from ai-benchmark have been used to evaluate vGPU-device-plugin performance as follows
+
+| Test Environment | description                                              |
+| ---------------- | :------------------------------------------------------: |
+| Kubernetes version | v1.12.9                                                |
+| Docker  version    | 18.09.1                                                |
+| GPU Type           | Tesla V100                                             |
+| GPU Num            | 2                                                      |
+
+| Test instance |                         description                         |
+| ------------- | :---------------------------------------------------------: |
+| nvidia-device-plugin      |               k8s + nvidia k8s-device-plugin                |
+| vGPU-device-plugin        | k8s + VGPU k8s-device-plugin，without virtual device memory |
+| vGPU-device-plugin(virtual device memory) |  k8s + VGPU k8s-device-plugin，with virtual device memory   |
+
+Test Cases:
+
+| test id |     case      |   type    |         params          |
+| ------- | :-----------: | :-------: | :---------------------: |
+| 1.1     | Resnet-V2-50  | inference |  batch=50,size=346*346  |
+| 1.2     | Resnet-V2-50  | training  |  batch=20,size=346*346  |
+| 2.1     | Resnet-V2-152 | inference |  batch=10,size=256*256  |
+| 2.2     | Resnet-V2-152 | training  |  batch=10,size=256*256  |
+| 3.1     |    VGG-16     | inference |  batch=20,size=224*224  |
+| 3.2     |    VGG-16     | training  |  batch=2,size=224*224   |
+| 4.1     |    DeepLab    | inference |  batch=2,size=512*512   |
+| 4.2     |    DeepLab    | training  |  batch=1,size=384*384   |
+| 5.1     |     LSTM      | inference | batch=100,size=1024*300 |
+| 5.2     |     LSTM      | training  | batch=10,size=1024*300  |
+
+Test Result: ![img](./imgs/benchmark_inf.png)
+
+![img](./imgs/benchmark_train.png)
+
+To reproduce:
+
+1. install vGPU-nvidia-device-plugin，and configure properly
+2. run benchmark job
+
+```
+$ kubectl apply -f benchmarks/ai-benchmark/ai-benchmark.yml
+```
+
+3. View the result by using kubctl logs
+
+```
+$ kubectl logs [pod id]
+```
+
+## Features
+
+- Specify the number of vGPUs divided by each physical GPU.
+- Limit vGPU's Device Memory.
+- Allows vGPU allocation by specifying device memory 
+- Limit vGPU's Streaming Multiprocessor.
+- Allows vGPU allocation by specifying device core usage
+- Zero changes to existing programs
+
+## Experimental Features
+
+- Virtual Device Memory
+
+  The device memory of the vGPU can exceed the physical device memory of the GPU. At this time, the excess part will be put in the RAM, which will have a certain impact on the performance.
+
+## Known Issues
+
+- Currently, A100 MIG not supported 
+- Currently, only computing tasks are supported, and video codec processing is not supported.
+
+## TODO
+
+- Support video codec processing
+- Support Multi-Instance GPUs (MIG)
+
 ## Tests
 
 - TensorFlow 1.14.0/2.4.1