-
Notifications
You must be signed in to change notification settings - Fork 2
Run GPU task on the server
wangd12rpi edited this page Oct 15, 2024
·
2 revisions
In the tf02 server, We should only use GPU 0 - 3 as GPU 4 - 7 belongs to other groups.
When running a task that requires multiple GPU running in parallel like training/finetuning, use deepspeed to run the python script. Example: runing train_lora.py
on GPU 2 and 3. The nohup
and &
at the end allows it to continue runing after the terminal is closed.
nohup deepspeed --include localhost:2,3 train_lora.py &
To see running tasks that are using the GPU, and see GPU usage, run nvidia-smi
. It would display something like this:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A6000 On | 00000000:01:00.0 Off | Off |
| 30% 22C P8 19W / 300W | 3MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A6000 On | 00000000:25:00.0 Off | Off |
| 30% 21C P8 23W / 300W | 3MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA RTX A6000 On | 00000000:41:00.0 Off | Off |
| 30% 42C P2 271W / 300W | 27536MiB / 49140MiB | 100% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA RTX A6000 On | 00000000:61:00.0 Off | Off |
| 30% 41C P2 296W / 300W | 28350MiB / 49140MiB | 100% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA RTX A6000 On | 00000000:81:00.0 Off | Off |
| 30% 22C P8 15W / 300W | 3MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA RTX A6000 On | 00000000:A1:00.0 Off | Off |
| 30% 22C P8 21W / 300W | 3MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA RTX A6000 On | 00000000:C1:00.0 Off | Off |
| 30% 22C P8 20W / 300W | 3MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA RTX A6000 On | 00000000:E1:00.0 Off | Off |
| 30% 21C P8 21W / 300W | 3MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 2 N/A N/A 949715 C ...2/miniconda3/envs/finenv/bin/python 27530MiB |
| 3 N/A N/A 949716 C ...2/miniconda3/envs/finenv/bin/python 28344MiB |
+---------------------------------------------------------------------------------------+
To end a task, use kill
and the associated PID:
kill 949715