Incompatibility issues in AWS H100 #761

perezpaznoemi · 2024-10-29T23:53:53Z

Hi
We are facing an issue with incompatibility and we have been trying different UBUNTU versions. If I riun hello word in docker works and CUDA, took kit and drivers seem ok. I checked the libraries and those were fine (libnvidia,ml.so.1) however OCI runtime file. Any idea?

ubuntu@ip-172-31-17-183:$ docker run --gpus all -d -p 80:80 -e HF_TOKEN=ZXXXX767398115161.dkr.ecr.us-east-1.amazonaws.com/predictionaws3:latest
7ac5d43c43301058d56b098d19ab6f36683d1bd617361e677a4b4acc77be3cf3
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.
ubuntu@ip-172-31-17-183:$ nvidia-smi
Tue Oct 29 05:49:19 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05 Driver Version: 550.127.05 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla T4 Off | 00000000:00:1E.0 Off | 0 |
| N/A 20C P8 10W / 70W | 1MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
ubuntu@ip-172-31-17-183:~$ docker run hello-world

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:

The Docker client contacted the Docker daemon.
The Docker daemon pulled the "hello-world" image from the Docker Hub.

Docker container

#libraries:libnvidia-ml.so, libnvidia-ml.so.1, libnvidia-ml.so.535.183.01, libnvidia-ml.so.550.127.05
FROM docker.io/nvidia/cuda:12.4.0-runtime-ubuntu20.04

Install Python and pip

RUN apt-get update &&
apt-get install -y python3 python3-pip &&
apt-get clean &&
rm -rf /var/lib/apt/lists/*

Set the working directory

WORKDIR /data

Copy input files and scripts

COPY md/1_medical.docx /data/input/
COPY md/1_genetic.csv /data/input/
COPY scripts/aws_md.py /data/scripts/
COPY requirements.txt /data/

Install required Python packages

RUN pip3 install --no-cache-dir -r requirements.txt

Set environment variables for input files (if needed)

(amd64)

The text was updated successfully, but these errors were encountered:

elezar · 2024-11-16T00:05:51Z

@perezpaznoemi what is the output of:

nvidia-ctk --version

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incompatibility issues in AWS H100 #761

Incompatibility issues in AWS H100 #761

perezpaznoemi commented Oct 29, 2024

elezar commented Nov 16, 2024

Incompatibility issues in AWS H100 #761

Incompatibility issues in AWS H100 #761

Comments

perezpaznoemi commented Oct 29, 2024

Install Python and pip

Set the working directory

Copy input files and scripts

Install required Python packages

Set environment variables for input files (if needed)

elezar commented Nov 16, 2024