Error in the first few epochs (Windows) #1722

humunumuh · 2023-10-05T06:14:13Z

humunumuh
Oct 5, 2023

Hello, I keep getting this error after 1-3 epochs of training. Has anybody encountered this? I've tried reinstalling nnunet and using a venv to no avail. The data was preprocessed with "nnUNetv2_plan_and_preprocess -d 11 --verify_dataset_integrity --clean"
Thanks.

PS D:\work\misc\nnunet_files> nnUNetv2_train 11 3d_fullres 0
Using device: cuda:0

#######################################################################
Please cite the following paper when using nnU-Net:
Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211.
#######################################################################

This is the configuration used by this training:
Configuration name: 3d_fullres
{'data_identifier': 'nnUNetPlans_3d_fullres', 'preprocessor_name': 'DefaultPreprocessor', 'batch_size': 2, 'patch_size': [96, 192, 192], 'median_image_size_in_voxels': [144.0, 291.0, 291.0], 'spacing': [1.5, 0.71875, 0.71875], 'normalization_schemes': ['ZScoreNormalization'], 'use_mask_for_norm': [False], 'UNet_class_name': 'PlainConvUNet', 'UNet_base_num_features': 32, 'n_conv_per_stage_encoder': [2, 2, 2, 2, 2, 2], 'n_conv_per_stage_decoder': [2, 2, 2, 2, 2], 'num_pool_per_axis': [4, 5, 5], 'pool_op_kernel_sizes': [[1, 1, 1], [1, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2]], 'conv_kernel_sizes': [[1, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3]], 'unet_max_num_features': 320, 'resampling_fn_data': 'resample_data_or_seg_to_shape', 'resampling_fn_seg': 'resample_data_or_seg_to_shape', 'resampling_fn_data_kwargs': {'is_seg': False, 'order': 3, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_seg_kwargs': {'is_seg': True, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_probabilities': 'resample_data_or_seg_to_shape', 'resampling_fn_probabilities_kwargs': {'is_seg': False, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'batch_dice': False}

These are the global plan.json settings:
{'dataset_name': 'Dataset011_NewBMets', 'plans_name': 'nnUNetPlans', 'original_median_spacing_after_transp': [1.5, 0.71875, 0.71875], 'original_median_shape_after_transp': [144, 319, 319], 'image_reader_writer': 'SimpleITKIO', 'transpose_forward': [0, 1, 2], 'transpose_backward': [0, 1, 2], 'experiment_planner_used': 'ExperimentPlanner', 'label_manager': 'LabelManager', 'foreground_intensity_properties_per_channel': {'0': {'max': 1632.0, 'mean': 279.49896240234375, 'median': 270.0, 'min': 0.0, 'percentile_00_5': 19.0, 'percentile_99_5': 717.0, 'std': 135.7633819580078}}}

2023-10-05 16:06:03.456188: unpacking dataset...
2023-10-05 16:06:11.586733: unpacking done...
2023-10-05 16:06:11.591733: do_dummy_2d_data_aug: False
2023-10-05 16:06:11.597733: Creating new 5-fold cross-validation split...
2023-10-05 16:06:11.602733: Desired fold for training: 0
2023-10-05 16:06:11.606732: This split has 96 training and 24 validation cases.
2023-10-05 16:06:20.419466: Unable to plot network architecture:
2023-10-05 16:06:20.428466: module 'torch.onnx' has no attribute 'optimize_trace'
2023-10-05 16:06:20.495287:
2023-10-05 16:06:20.500287: Epoch 0
2023-10-05 16:06:20.504287: Current learning rate: 0.01
using pin_memory on device 0
C:\Users\physics\AppData\Local\Programs\Python\Python311\Lib\site-packages\batchgenerators\augmentations\utils.py:153: RuntimeWarning: overflow encountered in cast
return map_coordinates(img.astype(float), coords, order=order, mode=mode, cval=cval).astype(img.dtype)
C:\Users\physics\AppData\Local\Programs\Python\Python311\Lib\site-packages\numpy\core_methods.py:118: RuntimeWarning: invalid value encountered in reduce
ret = umr_sum(arr, axis, dtype, out, keepdims, where=where)
C:\Users\physics\AppData\Local\Programs\Python\Python311\Lib\site-packages\numpy\core_methods.py:152: RuntimeWarning: invalid value encountered in reduce
arrmean = umr_sum(arr, axis, dtype, keepdims=True, where=where)
C:\Users\physics\AppData\Local\Programs\Python\Python311\Lib\site-packages\batchgenerators\augmentations\color_augmentations.py:140: RuntimeWarning: invalid value encountered in subtract
data_sample[c] = np.power(((data_sample[c] - minm) / float(rnge + epsilon)), gamma) * float(rnge + epsilon) + minm
C:\Users\physics\AppData\Local\Programs\Python\Python311\Lib\site-packages\batchgenerators\augmentations\color_augmentations.py:140: RuntimeWarning: invalid value encountered in divide
data_sample[c] = np.power(((data_sample[c] - minm) / float(rnge + epsilon)), gamma) * float(rnge + epsilon) + minm
using pin_memory on device 0
2023-10-05 16:08:26.985589: train_loss nan
2023-10-05 16:08:26.992080: val_loss 0.0043
2023-10-05 16:08:26.996080: Pseudo dice [0.0]
2023-10-05 16:08:27.003603: Epoch time: 126.49 s
2023-10-05 16:08:27.011149: Yayy! New best EMA pseudo Dice: 0.0
2023-10-05 16:08:27.818105:
2023-10-05 16:08:27.823692: Epoch 1
2023-10-05 16:08:27.828691: Current learning rate: 0.00999
2023-10-05 16:10:09.868677: train_loss 0.0028
2023-10-05 16:10:09.878208: val_loss -0.0119
2023-10-05 16:10:09.885219: Pseudo dice [0.0]
2023-10-05 16:10:09.893760: Epoch time: 102.05 s
2023-10-05 16:10:10.523453:
2023-10-05 16:10:10.528598: Epoch 2
2023-10-05 16:10:10.532636: Current learning rate: 0.00998
C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\ScatterGatherKernel.cu:367: block: [4827,0,0], thread: [56,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
Traceback (most recent call last):
File "", line 198, in run_module_as_main
File "", line 88, in run_code
File "C:\Users\physics\AppData\Local\Programs\Python\Python311\Scripts\nnUNetv2_train.exe_main.py", line 7, in
File "C:\Users\physics\AppData\Local\Programs\Python\Python311\Lib\site-packages\nnunetv2\run\run_training.py", line 268, in run_training_entry
run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights,
File "C:\Users\physics\AppData\Local\Programs\Python\Python311\Lib\site-packages\nnunetv2\run\run_training.py", line 204, in run_training
nnunet_trainer.run_training()
File "C:\Users\physics\AppData\Local\Programs\Python\Python311\Lib\site-packages\nnunetv2\training\nnUNetTrainer\nnUNetTrainer.py", line 1240, in run_training
train_outputs.append(self.train_step(next(self.dataloader_train)))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\physics\AppData\Local\Programs\Python\Python311\Lib\site-packages\nnunetv2\training\nnUNetTrainer\nnUNetTrainer.py", line 888, in train_step
torch.nn.utils.clip_grad_norm(self.network.parameters(), 12)
File "C:\Users\physics\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\nn\utils\clip_grad.py", line 76, in clip_grad_norm
torch.foreach_mul(grads, clip_coef_clamped.to(device)) # type: ignore[call-overload]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception in thread Thread-4 (results_loop):
Exception in thread Thread-5 (results_loop):
Traceback (most recent call last):
Traceback (most recent call last):
File "C:\Users\physics\AppData\Local\Programs\Python\Python311\Lib\threading.py", line 1038, in _bootstrap_inner
File "C:\Users\physics\AppData\Local\Programs\Python\Python311\Lib\threading.py", line 1038, in _bootstrap_inner
self.run()
self.run()
File "C:\Users\physics\AppData\Local\Programs\Python\Python311\Lib\threading.py", line 975, in run
File "C:\Users\physics\AppData\Local\Programs\Python\Python311\Lib\threading.py", line 975, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\physics\AppData\Local\Programs\Python\Python311\Lib\site-packages\batchgenerators\dataloading\nondet_multi_threaded_augmenter.py", line 125, in results_loop
self._target(*self._args, **self._kwargs)
raise e
File "C:\Users\physics\AppData\Local\Programs\Python\Python311\Lib\site-packages\batchgenerators\dataloading\nondet_multi_threaded_augmenter.py", line 125, in results_loop
File "C:\Users\physics\AppData\Local\Programs\Python\Python311\Lib\site-packages\batchgenerators\dataloading\nondet_multi_threaded_augmenter.py", line 103, in results_loop
raise e
File "C:\Users\physics\AppData\Local\Programs\Python\Python311\Lib\site-packages\batchgenerators\dataloading\nondet_multi_threaded_augmenter.py", line 103, in results_loop
raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "

learningdeep2023 · 2023-10-11T15:22:55Z

learningdeep2023
Oct 11, 2023

You are most likely running out of memory, gpu memory or disk space. There are a number of threads on this...

1 reply

humunumuh Oct 15, 2023
Author

I have over 96gb of ram, 24gb vram, and plenty of disk space. Have already monitored them all and none are near full when the error occurs. I've tried on linux as well and it's happening there too. I've checked the other threads any not found any solutions there.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in the first few epochs (Windows) #1722

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Error in the first few epochs (Windows) #1722

humunumuh Oct 5, 2023

Replies: 1 comment · 1 reply

learningdeep2023 Oct 11, 2023

humunumuh Oct 15, 2023 Author

humunumuh
Oct 5, 2023

Replies: 1 comment 1 reply

learningdeep2023
Oct 11, 2023

humunumuh Oct 15, 2023
Author