Skip to content

Commit 90549b2

Browse files
committed
Add draft gpu troubles
1 parent 0f3236f commit 90549b2

File tree

1 file changed

+80
-0
lines changed

1 file changed

+80
-0
lines changed
Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
# GPU Troubleshooting Guide
2+
3+
This guide presents how to identify and resolve issues on Amazon EC2 instances with NVIDIA GPUs.
4+
5+
While running High-Performance Computing or Machine Learning workloads, GPUs may fail for various reasons captured by Xid messages.
6+
Those messages are placed in `/var/log/messages` for Amazon Linux or for Ubuntu in `/var/log/syslog` and `/var/log/kern.log`
7+
8+
| Xid | Failure | Resolution | Orchestrator |
9+
| --- | --------------------- | ------------------- | ------------------------------------------------------- |
10+
| 48 | Double Bit ECC | Terminate instances | [AWS ParallelCluster](#Terminate-and-replace-instances) |
11+
| 94 | Contained ECC error | Reset GPUs | [AWS ParallelCluster](#reset-gpus) |
12+
| 95 | Uncontained ECC error | Reset GPUs | [AWS ParallelCluster](#reset-gpus) |
13+
14+
# AWS ParallelCluster
15+
16+
## Terminate and replace instances
17+
18+
1. Create a reservation to isolate the node from being used by any jobs.
19+
```bash
20+
sudo /opt/slurm/bin/scontrol create res starttime=now duration=infinite flags=ignore_jobs user=root nodes=[NODE_TO_TERMINATE]
21+
```
22+
23+
1. Identify jobs using nodes to terminate
24+
```bash
25+
squeue -w [NODE_TO_TERMINATE] -o %A -h
26+
```
27+
28+
1. Cancel
29+
```bash
30+
scancel [JOB_ID]
31+
```
32+
33+
1. Place the node in **DRAIN**.
34+
```bash
35+
sudo /opt/slurm/bin/scontrol update node=[NODE_TO_TERMINATE] state=drain reason=gpus-fail
36+
```
37+
38+
The node will have a **DRAIN** status. Then the instance will be terminated and replaced.
39+
40+
41+
1. Delete the reservation
42+
```bash
43+
sudo /opt/slurm/bin/scontrol delete res root_[RES_NUMBER]
44+
```
45+
46+
## Reset GPUs
47+
48+
1. Create a reservation to isolate the node from being used by any jobs.
49+
```bash
50+
sudo /opt/slurm/bin/scontrol create res starttime=now duration=infinite flags=ignore_jobs user=root nodes=[NODE_TO_TERMINATE]
51+
```
52+
53+
1. Identify jobs using nodes to terminate
54+
```bash
55+
squeue -w [NODE_TO_TERMINATE] -o %A -h
56+
```
57+
58+
1. Cancel
59+
```bash
60+
scancel [JOB_ID]
61+
```
62+
63+
1. Reset the GPUs
64+
```bash
65+
sudo /opt/slurm/bin/srun -w [NODE_TO_TERMINATE] nvidia-smi -r
66+
```
67+
68+
1. Delete the reservation
69+
```bash
70+
sudo /opt/slurm/bin/scontrol delete res root_[RES_NUMBER]
71+
```
72+
73+
74+
# Amazon SageMaker HyperPod
75+
76+
TBD
77+
78+
# Amazon EKS
79+
80+
TBD

0 commit comments

Comments
 (0)