Skip to content

Run GPU task on the server

wangd12rpi edited this page Oct 15, 2024 · 2 revisions

In the tf02 server, We should only use GPU 0 - 3 as GPU 4 - 7 belongs to other groups.

When running a task that requires multiple GPU running in parallel like training/finetuning, use deepspeed to run the python script. Example: runing train_lora.py on GPU 2 and 3. The nohup and & at the end allows it to continue runing after the terminal is closed.

nohup deepspeed --include localhost:2,3 train_lora.py &

To see running tasks that are using the GPU, and see GPU usage, run nvidia-smi. It would display something like this:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               On  | 00000000:01:00.0 Off |                  Off |
| 30%   22C    P8              19W / 300W |      3MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000               On  | 00000000:25:00.0 Off |                  Off |
| 30%   21C    P8              23W / 300W |      3MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A6000               On  | 00000000:41:00.0 Off |                  Off |
| 30%   42C    P2             271W / 300W |  27536MiB / 49140MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A6000               On  | 00000000:61:00.0 Off |                  Off |
| 30%   41C    P2             296W / 300W |  28350MiB / 49140MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA RTX A6000               On  | 00000000:81:00.0 Off |                  Off |
| 30%   22C    P8              15W / 300W |      3MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA RTX A6000               On  | 00000000:A1:00.0 Off |                  Off |
| 30%   22C    P8              21W / 300W |      3MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA RTX A6000               On  | 00000000:C1:00.0 Off |                  Off |
| 30%   22C    P8              20W / 300W |      3MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA RTX A6000               On  | 00000000:E1:00.0 Off |                  Off |
| 30%   21C    P8              21W / 300W |      3MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    2   N/A  N/A    949715      C   ...2/miniconda3/envs/finenv/bin/python    27530MiB |
|    3   N/A  N/A    949716      C   ...2/miniconda3/envs/finenv/bin/python    28344MiB |
+---------------------------------------------------------------------------------------+

To end a task, use kill and the associated PID:

kill 949715
Clone this wiki locally