Replies: 1 comment 1 reply
-
You are most likely running out of memory, gpu memory or disk space. There are a number of threads on this... |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello, I keep getting this error after 1-3 epochs of training. Has anybody encountered this? I've tried reinstalling nnunet and using a venv to no avail. The data was preprocessed with "nnUNetv2_plan_and_preprocess -d 11 --verify_dataset_integrity --clean"
Thanks.
PS D:\work\misc\nnunet_files> nnUNetv2_train 11 3d_fullres 0
Using device: cuda:0
#######################################################################
Please cite the following paper when using nnU-Net:
Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211.
#######################################################################
This is the configuration used by this training:
Configuration name: 3d_fullres
{'data_identifier': 'nnUNetPlans_3d_fullres', 'preprocessor_name': 'DefaultPreprocessor', 'batch_size': 2, 'patch_size': [96, 192, 192], 'median_image_size_in_voxels': [144.0, 291.0, 291.0], 'spacing': [1.5, 0.71875, 0.71875], 'normalization_schemes': ['ZScoreNormalization'], 'use_mask_for_norm': [False], 'UNet_class_name': 'PlainConvUNet', 'UNet_base_num_features': 32, 'n_conv_per_stage_encoder': [2, 2, 2, 2, 2, 2], 'n_conv_per_stage_decoder': [2, 2, 2, 2, 2], 'num_pool_per_axis': [4, 5, 5], 'pool_op_kernel_sizes': [[1, 1, 1], [1, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2]], 'conv_kernel_sizes': [[1, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3]], 'unet_max_num_features': 320, 'resampling_fn_data': 'resample_data_or_seg_to_shape', 'resampling_fn_seg': 'resample_data_or_seg_to_shape', 'resampling_fn_data_kwargs': {'is_seg': False, 'order': 3, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_seg_kwargs': {'is_seg': True, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_probabilities': 'resample_data_or_seg_to_shape', 'resampling_fn_probabilities_kwargs': {'is_seg': False, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'batch_dice': False}
These are the global plan.json settings:
{'dataset_name': 'Dataset011_NewBMets', 'plans_name': 'nnUNetPlans', 'original_median_spacing_after_transp': [1.5, 0.71875, 0.71875], 'original_median_shape_after_transp': [144, 319, 319], 'image_reader_writer': 'SimpleITKIO', 'transpose_forward': [0, 1, 2], 'transpose_backward': [0, 1, 2], 'experiment_planner_used': 'ExperimentPlanner', 'label_manager': 'LabelManager', 'foreground_intensity_properties_per_channel': {'0': {'max': 1632.0, 'mean': 279.49896240234375, 'median': 270.0, 'min': 0.0, 'percentile_00_5': 19.0, 'percentile_99_5': 717.0, 'std': 135.7633819580078}}}
2023-10-05 16:06:03.456188: unpacking dataset...
2023-10-05 16:06:11.586733: unpacking done...
2023-10-05 16:06:11.591733: do_dummy_2d_data_aug: False
2023-10-05 16:06:11.597733: Creating new 5-fold cross-validation split...
2023-10-05 16:06:11.602733: Desired fold for training: 0
2023-10-05 16:06:11.606732: This split has 96 training and 24 validation cases.
2023-10-05 16:06:20.419466: Unable to plot network architecture:
2023-10-05 16:06:20.428466: module 'torch.onnx' has no attribute 'optimize_trace'
2023-10-05 16:06:20.495287:
2023-10-05 16:06:20.500287: Epoch 0
2023-10-05 16:06:20.504287: Current learning rate: 0.01
using pin_memory on device 0
C:\Users\physics\AppData\Local\Programs\Python\Python311\Lib\site-packages\batchgenerators\augmentations\utils.py:153: RuntimeWarning: overflow encountered in cast
return map_coordinates(img.astype(float), coords, order=order, mode=mode, cval=cval).astype(img.dtype)
C:\Users\physics\AppData\Local\Programs\Python\Python311\Lib\site-packages\numpy\core_methods.py:118: RuntimeWarning: invalid value encountered in reduce
ret = umr_sum(arr, axis, dtype, out, keepdims, where=where)
C:\Users\physics\AppData\Local\Programs\Python\Python311\Lib\site-packages\numpy\core_methods.py:152: RuntimeWarning: invalid value encountered in reduce
arrmean = umr_sum(arr, axis, dtype, keepdims=True, where=where)
C:\Users\physics\AppData\Local\Programs\Python\Python311\Lib\site-packages\batchgenerators\augmentations\color_augmentations.py:140: RuntimeWarning: invalid value encountered in subtract
data_sample[c] = np.power(((data_sample[c] - minm) / float(rnge + epsilon)), gamma) * float(rnge + epsilon) + minm
C:\Users\physics\AppData\Local\Programs\Python\Python311\Lib\site-packages\batchgenerators\augmentations\color_augmentations.py:140: RuntimeWarning: invalid value encountered in divide
data_sample[c] = np.power(((data_sample[c] - minm) / float(rnge + epsilon)), gamma) * float(rnge + epsilon) + minm
using pin_memory on device 0
2023-10-05 16:08:26.985589: train_loss nan
2023-10-05 16:08:26.992080: val_loss 0.0043
2023-10-05 16:08:26.996080: Pseudo dice [0.0]
2023-10-05 16:08:27.003603: Epoch time: 126.49 s
2023-10-05 16:08:27.011149: Yayy! New best EMA pseudo Dice: 0.0
2023-10-05 16:08:27.818105:
2023-10-05 16:08:27.823692: Epoch 1
2023-10-05 16:08:27.828691: Current learning rate: 0.00999
2023-10-05 16:10:09.868677: train_loss 0.0028
2023-10-05 16:10:09.878208: val_loss -0.0119
2023-10-05 16:10:09.885219: Pseudo dice [0.0]
2023-10-05 16:10:09.893760: Epoch time: 102.05 s
2023-10-05 16:10:10.523453:
2023-10-05 16:10:10.528598: Epoch 2
2023-10-05 16:10:10.532636: Current learning rate: 0.00998
C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\ScatterGatherKernel.cu:367: block: [4827,0,0], thread: [56,0,0] Assertion
idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.Traceback (most recent call last):
File "", line 198, in run_module_as_main
File "", line 88, in run_code
File "C:\Users\physics\AppData\Local\Programs\Python\Python311\Scripts\nnUNetv2_train.exe_main.py", line 7, in
File "C:\Users\physics\AppData\Local\Programs\Python\Python311\Lib\site-packages\nnunetv2\run\run_training.py", line 268, in run_training_entry
run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights,
File "C:\Users\physics\AppData\Local\Programs\Python\Python311\Lib\site-packages\nnunetv2\run\run_training.py", line 204, in run_training
nnunet_trainer.run_training()
File "C:\Users\physics\AppData\Local\Programs\Python\Python311\Lib\site-packages\nnunetv2\training\nnUNetTrainer\nnUNetTrainer.py", line 1240, in run_training
train_outputs.append(self.train_step(next(self.dataloader_train)))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\physics\AppData\Local\Programs\Python\Python311\Lib\site-packages\nnunetv2\training\nnUNetTrainer\nnUNetTrainer.py", line 888, in train_step
torch.nn.utils.clip_grad_norm(self.network.parameters(), 12)
File "C:\Users\physics\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\nn\utils\clip_grad.py", line 76, in clip_grad_norm
torch.foreach_mul(grads, clip_coef_clamped.to(device)) # type: ignore[call-overload]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.Exception in thread Thread-4 (results_loop):
Exception in thread Thread-5 (results_loop):
Traceback (most recent call last):
Traceback (most recent call last):
File "C:\Users\physics\AppData\Local\Programs\Python\Python311\Lib\threading.py", line 1038, in _bootstrap_inner
File "C:\Users\physics\AppData\Local\Programs\Python\Python311\Lib\threading.py", line 1038, in _bootstrap_inner
self.run()
self.run()
File "C:\Users\physics\AppData\Local\Programs\Python\Python311\Lib\threading.py", line 975, in run
File "C:\Users\physics\AppData\Local\Programs\Python\Python311\Lib\threading.py", line 975, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\physics\AppData\Local\Programs\Python\Python311\Lib\site-packages\batchgenerators\dataloading\nondet_multi_threaded_augmenter.py", line 125, in results_loop
self._target(*self._args, **self._kwargs)
raise e
File "C:\Users\physics\AppData\Local\Programs\Python\Python311\Lib\site-packages\batchgenerators\dataloading\nondet_multi_threaded_augmenter.py", line 125, in results_loop
File "C:\Users\physics\AppData\Local\Programs\Python\Python311\Lib\site-packages\batchgenerators\dataloading\nondet_multi_threaded_augmenter.py", line 103, in results_loop
raise e
File "C:\Users\physics\AppData\Local\Programs\Python\Python311\Lib\site-packages\batchgenerators\dataloading\nondet_multi_threaded_augmenter.py", line 103, in results_loop
raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
Beta Was this translation helpful? Give feedback.
All reactions