Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL Error 1: unhandled cuda error #58

Open
pengjunxing opened this issue Jul 19, 2024 · 0 comments
Open

NCCL Error 1: unhandled cuda error #58

pengjunxing opened this issue Jul 19, 2024 · 0 comments

Comments

@pengjunxing
Copy link

I am using Ubuntu 22.04 system with dual 4090GPU and 18.04Ubuntu under Docker. After configuring the environment and modifying the code, I tried to train a dataset of 3 frames per group. The training error message is as follows:
Do any friends know how to solve it? Thank you very much!

(flavr_env) root@22727250d64b :/dataset/FLAVR# python main.py --batch_size 32 --test_batch_size 32 --dataset vimeo90K_septuplet --loss 1L1 --max_epoch 200 --lr 0.0002 --data_root /dataset/vimeo_triplet --n_outputs 1
CUDA version: 10.1
CuDNN version: 7603
Is CUDA available: True
Namespace(batch_size=32, beta1=0.9, beta2=0.99, checkpoint_dir='.', cuda=True, data_root='/dataset/vimeo_triplet', dataset='vimeo90K_septuplet', exp_name='exp', joinType='concat', load_from=None, log_iter=60, loss='1
L1', lr=0.0002, max_epoch=200, model='unet_18', n_outputs=1, nbr_frame=4, nbr_width=1, num_gpu=1, num_workers=16, pretrained=None, random_seed=12345, resume=False, resume_exp=None, start_epoch=0, test_batch_size=32, upmode='transpose', use_tensorboard=False, val_freq=1)
Building model: unet_18
Preparing loss function:
1.000 * L1
terminate called after throwing an instance of 'std::runtime_error'
what(): NCCL Error 1: unhandled cuda error
Aborted (core dumped)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant