Error while running nnUNet training: RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message #2641

elena-mulero · 2024-12-10T17:21:40Z

Hello everyone and thank you for providing the nnUNet code,

I am having problems when running the training. I am using an external server with OS Rocky Linux version 8.9, Python 3.12.3 and CUDA 12.1.0. I installed the environment as described, installing first the compatible PyTorch package and set up the paths in the environment. I ran the preprocessing command without problems in both custom datasets I am using.
I tried using different GPUs such as A100, A40, V100 and T4 to do the training but I always get the same error:
"
/path/nnUNetv2/nnunetv2-venv/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py:164: FutureWarning: torch.cuda.amp.GradScaler(args...) is deprecated. Please use torch.amp.GradScaler('cuda', args...) instead.
self.grad_scaler = GradScaler() if self.device.type == 'cuda' else None
/path/nnUNetv2/nnunetv2-venv/lib/python3.12/site-packages/torch/optim/lr_scheduler.py:62: UserWarning: The verbose parameter is deprecated. Please use get_last_lr() to access the learning rate.
warnings.warn(
Exception in thread Thread-2 (results_loop):
Traceback (most recent call last):
File "/apps/Arch/software/Python/3.12.3-GCCcore-13.3.0/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
Traceback (most recent call last):
File "/path/nnUNetv2/nnunetv2-venv/nnUNet/nnunetv2/run/run_training.py", line 285, in
self.run()
File "/apps/Arch/software/Python/3.12.3-GCCcore-13.3.0/lib/python3.12/threading.py", line 1010, in run
self._target(*self._args, **self._kwargs)
File "/path/nnUNetv2/nnunetv2-venv/lib/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop
run_training_entry()
File "/path/nnUNetv2/nnunetv2-venv/nnUNet/nnunetv2/run/run_training.py", line 275, in run_training_entry
run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights,
File "/path/nnUNetv2/nnunetv2-venv/nnUNet/nnunetv2/run/run_training.py", line 211, in run_training
nnunet_trainer.run_training()
File "/path/nnUNetv2/nnunetv2-venv/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1377, in run_training
val_outputs.append(self.validation_step(next(self.dataloader_val)))
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/nnUNetv2/nnunetv2-venv/lib/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 196, in next
item = self.__get_next_item()
^^^^^^^^^^^^^^^^^^^^^^
File "/path/nnUNetv2/nnunetv2-venv/lib/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 181, in __get_next_item
raise e
File "/path/nnUNetv2/nnunetv2-venv/lib/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop
raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
"
Sometimes I get the error immediately but at times I can reach some epochs... I tried with all configurations and different folds, but in all cases I got the error.
I tried to reduce the number of workers in all cases with export nnUNet_n_proc_DA=X, I set export nnUNet_compile=f as recommended in some other issues but still have this problem.
I could run the training for 2d configuration until epoch 914 but it crashed again with the same error using a T4 and 1 worker.

Could you please help me understand where or what is the problem? I would appreciate any recommendation.

Thank you!

The text was updated successfully, but these errors were encountered:

sunyan1024 · 2024-12-19T08:33:03Z

+1

tjhendrickson · 2024-12-19T15:45:29Z

+1

FabianIsensee assigned GregorKoehler Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error while running nnUNet training: RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message #2641

Error while running nnUNet training: RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message #2641

elena-mulero commented Dec 10, 2024

sunyan1024 commented Dec 19, 2024

tjhendrickson commented Dec 19, 2024

Error while running nnUNet training: RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message #2641

Error while running nnUNet training: RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message #2641

Comments

elena-mulero commented Dec 10, 2024

sunyan1024 commented Dec 19, 2024

tjhendrickson commented Dec 19, 2024