Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error while running nnUNet training: RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message #2641

Open
elena-mulero opened this issue Dec 10, 2024 · 2 comments
Assignees

Comments

@elena-mulero
Copy link

Hello everyone and thank you for providing the nnUNet code,

I am having problems when running the training. I am using an external server with OS Rocky Linux version 8.9, Python 3.12.3 and CUDA 12.1.0. I installed the environment as described, installing first the compatible PyTorch package and set up the paths in the environment. I ran the preprocessing command without problems in both custom datasets I am using.
I tried using different GPUs such as A100, A40, V100 and T4 to do the training but I always get the same error:
"
/path/nnUNetv2/nnunetv2-venv/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py:164: FutureWarning: torch.cuda.amp.GradScaler(args...) is deprecated. Please use torch.amp.GradScaler('cuda', args...) instead.
self.grad_scaler = GradScaler() if self.device.type == 'cuda' else None
/path/nnUNetv2/nnunetv2-venv/lib/python3.12/site-packages/torch/optim/lr_scheduler.py:62: UserWarning: The verbose parameter is deprecated. Please use get_last_lr() to access the learning rate.
warnings.warn(
Exception in thread Thread-2 (results_loop):
Traceback (most recent call last):
File "/apps/Arch/software/Python/3.12.3-GCCcore-13.3.0/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
Traceback (most recent call last):
File "/path/nnUNetv2/nnunetv2-venv/nnUNet/nnunetv2/run/run_training.py", line 285, in
self.run()
File "/apps/Arch/software/Python/3.12.3-GCCcore-13.3.0/lib/python3.12/threading.py", line 1010, in run
self._target(*self._args, **self._kwargs)
File "/path/nnUNetv2/nnunetv2-venv/lib/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop
run_training_entry()
File "/path/nnUNetv2/nnunetv2-venv/nnUNet/nnunetv2/run/run_training.py", line 275, in run_training_entry
run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights,
File "/path/nnUNetv2/nnunetv2-venv/nnUNet/nnunetv2/run/run_training.py", line 211, in run_training
nnunet_trainer.run_training()
File "/path/nnUNetv2/nnunetv2-venv/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1377, in run_training
val_outputs.append(self.validation_step(next(self.dataloader_val)))
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/nnUNetv2/nnunetv2-venv/lib/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 196, in next
item = self.__get_next_item()
^^^^^^^^^^^^^^^^^^^^^^
File "/path/nnUNetv2/nnunetv2-venv/lib/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 181, in __get_next_item
raise e
File "/path/nnUNetv2/nnunetv2-venv/lib/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop
raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
"
Sometimes I get the error immediately but at times I can reach some epochs... I tried with all configurations and different folds, but in all cases I got the error.
I tried to reduce the number of workers in all cases with export nnUNet_n_proc_DA=X, I set export nnUNet_compile=f as recommended in some other issues but still have this problem.
I could run the training for 2d configuration until epoch 914 but it crashed again with the same error using a T4 and 1 worker.

Could you please help me understand where or what is the problem? I would appreciate any recommendation.

Thank you!

@sunyan1024
Copy link

+1

1 similar comment
@tjhendrickson
Copy link

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants