Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeepSpeed Installation Fails During Docker Build (NVML Initialization Issue) #6945

Open
asdfry opened this issue Jan 13, 2025 · 3 comments
Open
Assignees

Comments

@asdfry
Copy link

asdfry commented Jan 13, 2025

Hello,
I encountered an issue while building a Docker image for deep learning model training, specifically when attempting to install DeepSpeed.

Issue
When building the Docker image, the DeepSpeed installation fails with a warning that NVML initialization is not possible.
However, if I create a container from the same image and install DeepSpeed inside the container, the installation works without any issues.

Environment
Base Image: nvcr.io/nvidia/pytorch:23.01-py3
DeepSpeed Version: 0.16.2

Build Log
docker_build.log

Additional Context
The problem does not occur with the newer base image nvcr.io/nvidia/pytorch:24.05-py3.

Thank you.

@loadams loadams self-assigned this Jan 13, 2025
@loadams
Copy link
Contributor

loadams commented Jan 13, 2025

Hi @asdfry - The errors appear to be from gcc, perhaps the gcc versions are different and causing issues?

error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

Also some of the warnings clouding the output are from not having py-cpuinfo installed, could you add that and share the log again?

@loadams
Copy link
Contributor

loadams commented Jan 21, 2025

Hi @asdfry - following up on this, could you share the full dockerfile that you're using so we can repro?

@asdfry
Copy link
Author

asdfry commented Jan 21, 2025

Hello, thank you for continuing to follow up on this.
I apologize for forgetting about this issue as I’ve been occupied with other tasks.
I’m sharing the Dockerfile and the requirements.txt that can reproduce the error below.

FROM nvcr.io/nvidia/pytorch:23.01-py3

SHELL ["/bin/bash", "-c"]

USER root

WORKDIR /root

ENV DEBIAN_FRONTEND=noninteractive

# Set env for torch (compute capability)
ENV TORCH_CUDA_ARCH_LIST=9.0

# Install packages
RUN apt update && \
    curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash && \
    apt install -y git-lfs pdsh openssh-server net-tools tmux tree libaio-dev iputils-ping iproute2 libnvidia-compute-535

# Set for installation
ENV mlnx_image=MLNX_OFED_LINUX-23.10-3.2.2.0-ubuntu20.04-x86_64
ENV hpcx_image=hpcx-v2.18.1-gcc-mlnx_ofed-ubuntu20.04-cuda12-x86_64

# Install mlnx ofed
RUN wget http://www.mellanox.com/downloads/ofed/MLNX_OFED-23.10-3.2.2.0/$mlnx_image.tgz && \
    tar -xvf $mlnx_image.tgz && \
    rm $mlnx_image.tgz && \
    ./$mlnx_image/mlnxofedinstall --user-space-only --without-fw-update -q

# Install hpc-x
RUN wget http://www.mellanox.com/downloads/hpc/hpc-x/v2.18.1/$hpcx_image.tbz && \
    tar -xvf $hpcx_image.tbz && \
    rm $hpcx_image.tbz
ENV HPCX_HOME=/root/$hpcx_image

# Install python & pip and Install libraries
ENV DS_BUILD_CPU_ADAM=1
COPY requirements.txt requirements.txt
RUN curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && \
    python3 get-pip.py && \
    pip install --no-cache-dir -r requirements.txt

# Copy files that required for training
COPY source .
COPY configs configs
transformers==4.46.3
datasets==3.1.0
accelerate==1.0.1
nvitop==1.3.2
loguru==0.7.2
google-cloud-firestore==2.15.0
google-cloud-storage==2.14.0
jsonlines==4.0.0
peft==0.13.2
deepspeed==0.16.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants