Skip to content

Commit

Permalink
Add draft gpu troubles
Browse files Browse the repository at this point in the history
  • Loading branch information
mhuguesaws committed Apr 30, 2024
1 parent 4d24625 commit 30e6592
Showing 1 changed file with 70 additions and 0 deletions.
70 changes: 70 additions & 0 deletions troubleshooting/GPU-Troubleshooting.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# GPU Troubleshooting Guide

This guide presents how to identify and resolve issues on Amazon EC2 instances with NVIDIA GPUs.

While running High-Performance Computing or Machine Learning workloads, GPUs may fail for various reasons captured by Xid messages.
Those messages are placed in `/var/log/messages` for Amazon Linux or for Ubuntu in `/var/log/syslog` and `/var/log/kern.log`

| Xid | Failure | Resolution | Orchestrator |
| --- | --------------------- | ------------------- | ------------------------------------------------------- |
| 48 | Double Bit ECC | Terminate instances | [AWS ParallelCluster](#Terminate-and-replace-instances) |
| 94 | Contained ECC error | Reset GPUs | [AWS ParallelCluster](#reset-gpus) |
| 95 | Uncontained ECC error | Reset GPUs | [AWS ParallelCluster](#reset-gpus) |

# AWS ParallelCluster

## Terminate and replace instances

1. Create a reservation to isolate the node from being used by any jobs.
```bash
sudo /opt/slurm/bin/scontrol create res starttime=now duration=infinite flags=ignore_jobs user=root nodes=[NODE_TO_TERMINATE]
```

1. Cancel
```bash
scancel [JOB_ID]
```

1. Place the node in **DRAIN**.
```bash
sudo /opt/slurm/bin/scontrol update node=[NODE_TO_TERMINATE] state=drain reason=gpus-fail
```

The node will have a **DRAIN** status. Then the instance will be terminated and replaced.


1. Delete the reservation
```bash
sudo /opt/slurm/bin/scontrol delete res root_[RES_NUMBER]
```

## Reset GPUs

1. Create a reservation to isolate the node from being used by any jobs.
```bash
sudo /opt/slurm/bin/scontrol create res starttime=now duration=infinite flags=ignore_jobs user=root nodes=[NODE_TO_TERMINATE]
```

1. Cancel
```bash
scancel [JOB_ID]
```

1. Reset the GPUs
```bash
sudo /opt/slurm/bin/srun -w [NODE_TO_TERMINATE] nvidia-smi -r
```

1. Delete the reservation
```bash
sudo /opt/slurm/bin/scontrol delete res root_[RES_NUMBER]
```


# Amazon SageMaker HyperPod

TBD

# Amazon EKS

TBD

0 comments on commit 30e6592

Please sign in to comment.