Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR #235

Open
sdc-sdd opened this issue Mar 18, 2024 · 2 comments
Open

ERROR #235

sdc-sdd opened this issue Mar 18, 2024 · 2 comments

Comments

@sdc-sdd
Copy link

sdc-sdd commented Mar 18, 2024

RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:891, internal error, NCCL version 21.0.3
ncclInternalError: Internal check failed. This is either a bug in NCCL or due to memory corruption
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 36621) of binary: /home/xxx/anaconda3/envs/bev/bin/python

@lix19937
Copy link

lix19937 commented May 9, 2024

Run again with NCCL_DEBUG=INFO and provide the log. That will tell us what went wrong and what the reason for the crash could be.

@lix19937
Copy link

lix19937 commented May 9, 2024

If you use docker containers, it default to limited shared and pinned memory resources.
When using NCCL inside a container, it is recommended that you increase these resources by issuing:

–shm-size=32g –ulimit memlock=-1
in the command line to nvidia-docker run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants