Add draft gpu troubles

mhuguesaws · mhuguesaws · commit 90549b208370 · 2024-06-11T15:52:35.000-05:00
diff --git a/troubleshooting/GPU-Troubleshooting.md b/troubleshooting/GPU-Troubleshooting.md
@@ -0,0 +1,80 @@
+# GPU Troubleshooting Guide
+
+This guide presents how to identify and resolve issues on Amazon EC2 instances with NVIDIA GPUs.
+
+While running High-Performance Computing or Machine Learning workloads, GPUs may fail for various reasons captured by Xid messages.
+Those messages are placed in `/var/log/messages` for Amazon Linux or for Ubuntu in `/var/log/syslog` and `/var/log/kern.log`
+
+| Xid | Failure               | Resolution          | Orchestrator                                            |
+| --- | --------------------- | ------------------- | ------------------------------------------------------- |
+| 48  | Double Bit ECC        | Terminate instances | [AWS ParallelCluster](#Terminate-and-replace-instances) |
+| 94  | Contained ECC error   | Reset GPUs          | [AWS ParallelCluster](#reset-gpus)                      |
+| 95  | Uncontained ECC error | Reset GPUs          | [AWS ParallelCluster](#reset-gpus)                      |
+
+# AWS ParallelCluster
+
+## Terminate and replace instances
+
+1. Create a reservation to isolate the node from being used by any jobs.
+   ```bash
+   sudo /opt/slurm/bin/scontrol create res starttime=now duration=infinite flags=ignore_jobs user=root nodes=[NODE_TO_TERMINATE]
+   ```
+
+1. Identify jobs using nodes to terminate
+   ```bash
+   squeue -w [NODE_TO_TERMINATE] -o %A -h
+   ```
+
+1. Cancel
+   ```bash
+   scancel [JOB_ID]
+   ```
+
+1. Place the node in **DRAIN**.
+   ```bash
+   sudo /opt/slurm/bin/scontrol update node=[NODE_TO_TERMINATE] state=drain reason=gpus-fail
+   ```
+   
+   The node will have a **DRAIN** status. Then the instance will be terminated and replaced.
+
+
+1. Delete the reservation
+   ```bash
+   sudo /opt/slurm/bin/scontrol delete res root_[RES_NUMBER]
+   ```
+
+## Reset GPUs
+
+1. Create a reservation to isolate the node from being used by any jobs.
+   ```bash
+   sudo /opt/slurm/bin/scontrol create res starttime=now duration=infinite flags=ignore_jobs user=root nodes=[NODE_TO_TERMINATE]
+   ```
+
+1. Identify jobs using nodes to terminate
+   ```bash
+   squeue -w [NODE_TO_TERMINATE] -o %A -h
+   ```
+
+1. Cancel
+   ```bash
+   scancel [JOB_ID]
+   ```
+
+1. Reset the GPUs
+   ```bash
+   sudo /opt/slurm/bin/srun -w [NODE_TO_TERMINATE] nvidia-smi -r
+   ```
+
+1. Delete the reservation
+   ```bash
+   sudo /opt/slurm/bin/scontrol delete res root_[RES_NUMBER]
+   ```
+
+
+# Amazon SageMaker HyperPod
+
+TBD
+
+# Amazon EKS
+
+TBD