You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Error while running nnUNet training: RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
#2641
Open
elena-mulero opened this issue
Dec 10, 2024
· 2 comments
Hello everyone and thank you for providing the nnUNet code,
I am having problems when running the training. I am using an external server with OS Rocky Linux version 8.9, Python 3.12.3 and CUDA 12.1.0. I installed the environment as described, installing first the compatible PyTorch package and set up the paths in the environment. I ran the preprocessing command without problems in both custom datasets I am using.
I tried using different GPUs such as A100, A40, V100 and T4 to do the training but I always get the same error:
"
/path/nnUNetv2/nnunetv2-venv/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py:164: FutureWarning: torch.cuda.amp.GradScaler(args...) is deprecated. Please use torch.amp.GradScaler('cuda', args...) instead.
self.grad_scaler = GradScaler() if self.device.type == 'cuda' else None
/path/nnUNetv2/nnunetv2-venv/lib/python3.12/site-packages/torch/optim/lr_scheduler.py:62: UserWarning: The verbose parameter is deprecated. Please use get_last_lr() to access the learning rate.
warnings.warn(
Exception in thread Thread-2 (results_loop):
Traceback (most recent call last):
File "/apps/Arch/software/Python/3.12.3-GCCcore-13.3.0/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
Traceback (most recent call last):
File "/path/nnUNetv2/nnunetv2-venv/nnUNet/nnunetv2/run/run_training.py", line 285, in
self.run()
File "/apps/Arch/software/Python/3.12.3-GCCcore-13.3.0/lib/python3.12/threading.py", line 1010, in run
self._target(*self._args, **self._kwargs)
File "/path/nnUNetv2/nnunetv2-venv/lib/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop
run_training_entry()
File "/path/nnUNetv2/nnunetv2-venv/nnUNet/nnunetv2/run/run_training.py", line 275, in run_training_entry
run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights,
File "/path/nnUNetv2/nnunetv2-venv/nnUNet/nnunetv2/run/run_training.py", line 211, in run_training
nnunet_trainer.run_training()
File "/path/nnUNetv2/nnunetv2-venv/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1377, in run_training
val_outputs.append(self.validation_step(next(self.dataloader_val)))
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/nnUNetv2/nnunetv2-venv/lib/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 196, in next
item = self.__get_next_item()
^^^^^^^^^^^^^^^^^^^^^^
File "/path/nnUNetv2/nnunetv2-venv/lib/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 181, in __get_next_item
raise e
File "/path/nnUNetv2/nnunetv2-venv/lib/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop
raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
"
Sometimes I get the error immediately but at times I can reach some epochs... I tried with all configurations and different folds, but in all cases I got the error.
I tried to reduce the number of workers in all cases with export nnUNet_n_proc_DA=X, I set export nnUNet_compile=f as recommended in some other issues but still have this problem.
I could run the training for 2d configuration until epoch 914 but it crashed again with the same error using a T4 and 1 worker.
Could you please help me understand where or what is the problem? I would appreciate any recommendation.
Thank you!
The text was updated successfully, but these errors were encountered:
Hello everyone and thank you for providing the nnUNet code,
I am having problems when running the training. I am using an external server with OS Rocky Linux version 8.9, Python 3.12.3 and CUDA 12.1.0. I installed the environment as described, installing first the compatible PyTorch package and set up the paths in the environment. I ran the preprocessing command without problems in both custom datasets I am using.
I tried using different GPUs such as A100, A40, V100 and T4 to do the training but I always get the same error:
"
/path/nnUNetv2/nnunetv2-venv/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py:164: FutureWarning:
torch.cuda.amp.GradScaler(args...)
is deprecated. Please usetorch.amp.GradScaler('cuda', args...)
instead.self.grad_scaler = GradScaler() if self.device.type == 'cuda' else None
/path/nnUNetv2/nnunetv2-venv/lib/python3.12/site-packages/torch/optim/lr_scheduler.py:62: UserWarning: The verbose parameter is deprecated. Please use get_last_lr() to access the learning rate.
warnings.warn(
Exception in thread Thread-2 (results_loop):
Traceback (most recent call last):
File "/apps/Arch/software/Python/3.12.3-GCCcore-13.3.0/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
Traceback (most recent call last):
File "/path/nnUNetv2/nnunetv2-venv/nnUNet/nnunetv2/run/run_training.py", line 285, in
self.run()
File "/apps/Arch/software/Python/3.12.3-GCCcore-13.3.0/lib/python3.12/threading.py", line 1010, in run
self._target(*self._args, **self._kwargs)
File "/path/nnUNetv2/nnunetv2-venv/lib/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop
run_training_entry()
File "/path/nnUNetv2/nnunetv2-venv/nnUNet/nnunetv2/run/run_training.py", line 275, in run_training_entry
run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights,
File "/path/nnUNetv2/nnunetv2-venv/nnUNet/nnunetv2/run/run_training.py", line 211, in run_training
nnunet_trainer.run_training()
File "/path/nnUNetv2/nnunetv2-venv/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1377, in run_training
val_outputs.append(self.validation_step(next(self.dataloader_val)))
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/nnUNetv2/nnunetv2-venv/lib/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 196, in next
item = self.__get_next_item()
^^^^^^^^^^^^^^^^^^^^^^
File "/path/nnUNetv2/nnunetv2-venv/lib/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 181, in __get_next_item
raise e
File "/path/nnUNetv2/nnunetv2-venv/lib/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop
raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
"
Sometimes I get the error immediately but at times I can reach some epochs... I tried with all configurations and different folds, but in all cases I got the error.
I tried to reduce the number of workers in all cases with
export nnUNet_n_proc_DA=X
, I setexport nnUNet_compile=f
as recommended in some other issues but still have this problem.I could run the training for 2d configuration until epoch 914 but it crashed again with the same error using a T4 and 1 worker.
Could you please help me understand where or what is the problem? I would appreciate any recommendation.
Thank you!
The text was updated successfully, but these errors were encountered: