Description
🐞 bug report
Affected Rule
"@rules_python//python/extensions:pip.bzl" and any py_binary rules which have dependency "@pip//torch:pkg" and cause import torch
to be called.
Is this a regression?
Yes, the previous version in which this bug was not present was: ....Description
Torch ships with its own nvidia drivers which end up, for example, at site-packages/nvidia/nccl/lib/libnccl.so.2
. Torch is usually installed into site-packages/torch
and there is a library here site-packages/torch/_C.cpython-311-x86_64-linux-gnu.so
which declares a dependency on libnccl through the rpath mechanism. See the output of readelf below.
readelf -d _C.cpython-311-x86_64-linux-gnu.so | grep 'rpath\|runpath'
0x000000000000000f (RPATH) Library rpath: [$ORIGIN/../../nvidia/cublas/lib:$ORIGIN/../../nvidia/cuda_cupti/lib:$ORIGIN/../../nvidia/cuda_nvrtc/lib:$ORIGIN/../../nvidia/cuda_runtime/lib:$ORIGIN/../../nvidia/cudnn/lib:$ORIGIN/../../nvidia/cufft/lib:$ORIGIN/../../nvidia/curand/lib:$ORIGIN/../../nvidia/cusolver/lib:$ORIGIN/../../nvidia/cusparse/lib:$ORIGIN/../../nvidia/nccl/lib:$ORIGIN/../../nvidia/nvtx/lib:$ORIGIN:$ORIGIN/lib]
However, within a Bazel project, the pip extension puts all packages in their own directories so the linker is not able to find the nvidia drivers using relative paths. What ends up happening is that the linker either ends up finding the system drivers at /usr/local/...
or else the Python process raises an ImportError due to the linker failing.
To resolve the issue, I think Bazel would have to symlink Torch's dependencies into Torch's site-packages directory.
🔬 Minimal Reproduction
In MODULE.bazel
pip.parse(
hub_name = "pip",
python_version = python_version,
requirements_lock = "//:requirements.txt",
requirements_windows = "//:requirements_windows.txt",
)
use_repo(pip, "pip")
Then in the requirements file
torch==2.4.0+cu124
Then try to import torch from any py_binary with dependency on "@pip//torch:pkg".
🔥 Exception or Error
.../rules_python~~pip~pip_311_torch/site-packages/torch/__init__.py", line 294, in <module>
from torch._C import * # noqa: F403
^^^^^^^^^^^^^^^^^^^^^^
ImportError: libnccl.so.2: cannot open shared object file: No such file or directory
🌍 Your Environment
Operating System:
Ubuntu 22.04.4 LTS
Output of bazel version
:
7.3.0
Rules_python version:
0.31.0
Anything else relevant?