Skip to content

Linker Error with Torch 2.4.0+cu124 (ImportError: libnccl.so.2: cannot open shared object file: No such file or directory) #2125

Closed
@axbycc-mark

Description

@axbycc-mark

🐞 bug report

Affected Rule

"@rules_python//python/extensions:pip.bzl" and any py_binary rules which have dependency "@pip//torch:pkg" and cause import torch to be called.

Is this a regression?

Yes, the previous version in which this bug was not present was: ....

Description

Torch ships with its own nvidia drivers which end up, for example, at site-packages/nvidia/nccl/lib/libnccl.so.2. Torch is usually installed into site-packages/torch and there is a library here site-packages/torch/_C.cpython-311-x86_64-linux-gnu.so which declares a dependency on libnccl through the rpath mechanism. See the output of readelf below.

readelf -d _C.cpython-311-x86_64-linux-gnu.so | grep 'rpath\|runpath'
 0x000000000000000f (RPATH)              Library rpath: [$ORIGIN/../../nvidia/cublas/lib:$ORIGIN/../../nvidia/cuda_cupti/lib:$ORIGIN/../../nvidia/cuda_nvrtc/lib:$ORIGIN/../../nvidia/cuda_runtime/lib:$ORIGIN/../../nvidia/cudnn/lib:$ORIGIN/../../nvidia/cufft/lib:$ORIGIN/../../nvidia/curand/lib:$ORIGIN/../../nvidia/cusolver/lib:$ORIGIN/../../nvidia/cusparse/lib:$ORIGIN/../../nvidia/nccl/lib:$ORIGIN/../../nvidia/nvtx/lib:$ORIGIN:$ORIGIN/lib]

However, within a Bazel project, the pip extension puts all packages in their own directories so the linker is not able to find the nvidia drivers using relative paths. What ends up happening is that the linker either ends up finding the system drivers at /usr/local/... or else the Python process raises an ImportError due to the linker failing.

To resolve the issue, I think Bazel would have to symlink Torch's dependencies into Torch's site-packages directory.

🔬 Minimal Reproduction

In MODULE.bazel

pip.parse(
    hub_name = "pip",
    python_version = python_version,
    requirements_lock = "//:requirements.txt",
    requirements_windows = "//:requirements_windows.txt",
)
use_repo(pip, "pip")

Then in the requirements file

torch==2.4.0+cu124

Then try to import torch from any py_binary with dependency on "@pip//torch:pkg".

🔥 Exception or Error

.../rules_python~~pip~pip_311_torch/site-packages/torch/__init__.py", line 294, in <module>
    from torch._C import *  # noqa: F403
    ^^^^^^^^^^^^^^^^^^^^^^
ImportError: libnccl.so.2: cannot open shared object file: No such file or directory

🌍 Your Environment

Operating System:

  
Ubuntu 22.04.4 LTS
  

Output of bazel version:

  
7.3.0
  

Rules_python version:

  
0.31.0
  

Anything else relevant?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions