DeepSpeed Installation Fails During Docker Build (NVML Initialization Issue) #6945

asdfry · 2025-01-13T11:39:39Z

Hello,
I encountered an issue while building a Docker image for deep learning model training, specifically when attempting to install DeepSpeed.

Issue
When building the Docker image, the DeepSpeed installation fails with a warning that NVML initialization is not possible.
However, if I create a container from the same image and install DeepSpeed inside the container, the installation works without any issues.

Environment
Base Image: nvcr.io/nvidia/pytorch:23.01-py3
DeepSpeed Version: 0.16.2

Build Log
docker_build.log

Additional Context
The problem does not occur with the newer base image nvcr.io/nvidia/pytorch:24.05-py3.

Thank you.

The text was updated successfully, but these errors were encountered:

loadams · 2025-01-13T18:37:41Z

Hi @asdfry - The errors appear to be from gcc, perhaps the gcc versions are different and causing issues?

error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

Also some of the warnings clouding the output are from not having py-cpuinfo installed, could you add that and share the log again?

loadams · 2025-01-21T18:18:56Z

Hi @asdfry - following up on this, could you share the full dockerfile that you're using so we can repro?

asdfry · 2025-01-21T23:30:55Z

Hello, thank you for continuing to follow up on this.
I apologize for forgetting about this issue as I’ve been occupied with other tasks.
I’m sharing the Dockerfile and the requirements.txt that can reproduce the error below.

FROM nvcr.io/nvidia/pytorch:23.01-py3

SHELL ["/bin/bash", "-c"]

USER root

WORKDIR /root

ENV DEBIAN_FRONTEND=noninteractive

# Set env for torch (compute capability)
ENV TORCH_CUDA_ARCH_LIST=9.0

# Install packages
RUN apt update && \
    curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash && \
    apt install -y git-lfs pdsh openssh-server net-tools tmux tree libaio-dev iputils-ping iproute2 libnvidia-compute-535

# Set for installation
ENV mlnx_image=MLNX_OFED_LINUX-23.10-3.2.2.0-ubuntu20.04-x86_64
ENV hpcx_image=hpcx-v2.18.1-gcc-mlnx_ofed-ubuntu20.04-cuda12-x86_64

# Install mlnx ofed
RUN wget http://www.mellanox.com/downloads/ofed/MLNX_OFED-23.10-3.2.2.0/$mlnx_image.tgz && \
    tar -xvf $mlnx_image.tgz && \
    rm $mlnx_image.tgz && \
    ./$mlnx_image/mlnxofedinstall --user-space-only --without-fw-update -q

# Install hpc-x
RUN wget http://www.mellanox.com/downloads/hpc/hpc-x/v2.18.1/$hpcx_image.tbz && \
    tar -xvf $hpcx_image.tbz && \
    rm $hpcx_image.tbz
ENV HPCX_HOME=/root/$hpcx_image

# Install python & pip and Install libraries
ENV DS_BUILD_CPU_ADAM=1
COPY requirements.txt requirements.txt
RUN curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && \
    python3 get-pip.py && \
    pip install --no-cache-dir -r requirements.txt

# Copy files that required for training
COPY source .
COPY configs configs

transformers==4.46.3
datasets==3.1.0
accelerate==1.0.1
nvitop==1.3.2
loguru==0.7.2
google-cloud-firestore==2.15.0
google-cloud-storage==2.14.0
jsonlines==4.0.0
peft==0.13.2
deepspeed==0.16.2

loadams self-assigned this Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepSpeed Installation Fails During Docker Build (NVML Initialization Issue) #6945

DeepSpeed Installation Fails During Docker Build (NVML Initialization Issue) #6945

asdfry commented Jan 13, 2025

loadams commented Jan 13, 2025

loadams commented Jan 21, 2025

asdfry commented Jan 21, 2025

DeepSpeed Installation Fails During Docker Build (NVML Initialization Issue) #6945

DeepSpeed Installation Fails During Docker Build (NVML Initialization Issue) #6945

Comments

asdfry commented Jan 13, 2025

loadams commented Jan 13, 2025

loadams commented Jan 21, 2025

asdfry commented Jan 21, 2025