-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DeepSpeed Installation Fails During Docker Build (NVML Initialization Issue) #6945
Comments
Hi @asdfry - The errors appear to be from gcc, perhaps the gcc versions are different and causing issues?
Also some of the warnings clouding the output are from not having py-cpuinfo installed, could you add that and share the log again? |
Hi @asdfry - following up on this, could you share the full dockerfile that you're using so we can repro? |
Hello, thank you for continuing to follow up on this. FROM nvcr.io/nvidia/pytorch:23.01-py3
SHELL ["/bin/bash", "-c"]
USER root
WORKDIR /root
ENV DEBIAN_FRONTEND=noninteractive
# Set env for torch (compute capability)
ENV TORCH_CUDA_ARCH_LIST=9.0
# Install packages
RUN apt update && \
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash && \
apt install -y git-lfs pdsh openssh-server net-tools tmux tree libaio-dev iputils-ping iproute2 libnvidia-compute-535
# Set for installation
ENV mlnx_image=MLNX_OFED_LINUX-23.10-3.2.2.0-ubuntu20.04-x86_64
ENV hpcx_image=hpcx-v2.18.1-gcc-mlnx_ofed-ubuntu20.04-cuda12-x86_64
# Install mlnx ofed
RUN wget http://www.mellanox.com/downloads/ofed/MLNX_OFED-23.10-3.2.2.0/$mlnx_image.tgz && \
tar -xvf $mlnx_image.tgz && \
rm $mlnx_image.tgz && \
./$mlnx_image/mlnxofedinstall --user-space-only --without-fw-update -q
# Install hpc-x
RUN wget http://www.mellanox.com/downloads/hpc/hpc-x/v2.18.1/$hpcx_image.tbz && \
tar -xvf $hpcx_image.tbz && \
rm $hpcx_image.tbz
ENV HPCX_HOME=/root/$hpcx_image
# Install python & pip and Install libraries
ENV DS_BUILD_CPU_ADAM=1
COPY requirements.txt requirements.txt
RUN curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && \
python3 get-pip.py && \
pip install --no-cache-dir -r requirements.txt
# Copy files that required for training
COPY source .
COPY configs configs
|
Hello,
I encountered an issue while building a Docker image for deep learning model training, specifically when attempting to install DeepSpeed.
Issue
When building the Docker image, the DeepSpeed installation fails with a warning that NVML initialization is not possible.
However, if I create a container from the same image and install DeepSpeed inside the container, the installation works without any issues.
Environment
Base Image:
nvcr.io/nvidia/pytorch:23.01-py3
DeepSpeed Version:
0.16.2
Build Log
docker_build.log
Additional Context
The problem does not occur with the newer base image
nvcr.io/nvidia/pytorch:24.05-py3
.Thank you.
The text was updated successfully, but these errors were encountered: