Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incompatibility issues in AWS H100 #761

Open
perezpaznoemi opened this issue Oct 29, 2024 · 1 comment
Open

Incompatibility issues in AWS H100 #761

perezpaznoemi opened this issue Oct 29, 2024 · 1 comment

Comments

@perezpaznoemi
Copy link

Hi
We are facing an issue with incompatibility and we have been trying different UBUNTU versions. If I riun hello word in docker works and CUDA, took kit and drivers seem ok. I checked the libraries and those were fine (libnvidia,ml.so.1) however OCI runtime file. Any idea?

ubuntu@ip-172-31-17-183:$ docker run --gpus all -d -p 80:80 -e HF_TOKEN=ZXXXX767398115161.dkr.ecr.us-east-1.amazonaws.com/predictionaws3:latest
7ac5d43c43301058d56b098d19ab6f36683d1bd617361e677a4b4acc77be3cf3
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.
ubuntu@ip-172-31-17-183:
$ nvidia-smi
Tue Oct 29 05:49:19 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05 Driver Version: 550.127.05 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla T4 Off | 00000000:00:1E.0 Off | 0 |
| N/A 20C P8 10W / 70W | 1MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
ubuntu@ip-172-31-17-183:~$ docker run hello-world

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:

  1. The Docker client contacted the Docker daemon.
  2. The Docker daemon pulled the "hello-world" image from the Docker Hub.

Docker container

#libraries:libnvidia-ml.so, libnvidia-ml.so.1, libnvidia-ml.so.535.183.01, libnvidia-ml.so.550.127.05
FROM docker.io/nvidia/cuda:12.4.0-runtime-ubuntu20.04

Install Python and pip

RUN apt-get update &&
apt-get install -y python3 python3-pip &&
apt-get clean &&
rm -rf /var/lib/apt/lists/*

Set the working directory

WORKDIR /data

Copy input files and scripts

COPY md/1_medical.docx /data/input/
COPY md/1_genetic.csv /data/input/
COPY scripts/aws_md.py /data/scripts/
COPY requirements.txt /data/

Install required Python packages

RUN pip3 install --no-cache-dir -r requirements.txt

Set environment variables for input files (if needed)

(amd64)
@elezar
Copy link
Member

elezar commented Nov 16, 2024

@perezpaznoemi what is the output of:

nvidia-ctk --version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants