ERROR #235

sdc-sdd · 2024-03-18T09:47:57Z

RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:891, internal error, NCCL version 21.0.3
ncclInternalError: Internal check failed. This is either a bug in NCCL or due to memory corruption
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 36621) of binary: /home/xxx/anaconda3/envs/bev/bin/python

lix19937 · 2024-05-09T00:29:43Z

Run again with NCCL_DEBUG=INFO and provide the log. That will tell us what went wrong and what the reason for the crash could be.

lix19937 · 2024-05-09T00:31:58Z

If you use docker containers, it default to limited shared and pinned memory resources.
When using NCCL inside a container, it is recommended that you increase these resources by issuing:

–shm-size=32g –ulimit memlock=-1
in the command line to nvidia-docker run.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ERROR #235

ERROR #235

sdc-sdd commented Mar 18, 2024

lix19937 commented May 9, 2024

lix19937 commented May 9, 2024

ERROR #235

ERROR #235

Comments

sdc-sdd commented Mar 18, 2024

lix19937 commented May 9, 2024

lix19937 commented May 9, 2024