[Bug] CUDA error: an illegal memory access was encountered when training MVX-NET kitti #2209

maxiuw · 2023-01-09T17:06:30Z

Prerequisite

I have searched Issues and Discussions but cannot get the expected help.
I have read the FAQ documentation but cannot get the expected help.
The bug has not been fixed in the latest version (dev) or latest version (1.x).

Task

I'm using the official example scripts/configs for the officially supported tasks/models/datasets.

Branch

master branch https://github.com/open-mmlab/mmdetection3d

Environment

sys.platform: linux
Python: 3.8.13 (default, Mar 28 2022, 11:38:47) [GCC 7.5.0]
CUDA available: True
GPU 0: NVIDIA GeForce RTX 3090 Ti
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.8, V11.8.89
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
PyTorch: 1.10.2
PyTorch compiling details: PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 11.3
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
CuDNN 8.2
Magma 2.5.2
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.11.3
OpenCV: 4.6.0
MMCV: 1.6.2
MMCV Compiler: GCC 9.3
MMCV CUDA Compiler: 11.3
MMDetection: 2.26.0
MMSegmentation: 0.29.1
MMDetection3D: 1.0.0rc5+47285b3
spconv2.0: False

Reproduces the problem - code sample

python tools/train.py configs/mvxnet/dv_mvx-fpn_second_secfpn_adamw_2x8_80e_kitti-3d-3class.py

Reproduces the problem - command or script

python tools/train.py configs/mvxnet/dv_mvx-fpn_second_secfpn_adamw_2x8_80e_kitti-3d-3class.py

Reproduces the problem - error message

2023-01-09 17:50:54,289 - mmdet - INFO - Epoch [1][50/7424] lr: 4.323e-04, eta: 12:27:33, time: 0.151, data_time: 0.044, memory: 2903, loss_cls: 1.2267, loss_bbox: 3.5244, loss_dir: 0.1647, loss: 4.9158, grad_norm: 434.8607
2023-01-09 17:50:59,192 - mmdet - INFO - Epoch [1][100/7424] lr: 5.673e-04, eta: 10:16:19, time: 0.098, data_time: 0.002, memory: 3028, loss_cls: 1.1155, loss_bbox: 2.7561, loss_dir: 0.1583, loss: 4.0298, grad_norm: 66.9439
2023-01-09 17:51:03,828 - mmdet - INFO - Epoch [1][150/7424] lr: 7.023e-04, eta: 9:23:41, time: 0.093, data_time: 0.002, memory: 3351, loss_cls: 1.0779, loss_bbox: 1.9355, loss_dir: 0.1621, loss: 3.1756, grad_norm: 26.9783
2023-01-09 17:51:08,802 - mmdet - INFO - Epoch [1][200/7424] lr: 8.373e-04, eta: 9:05:41, time: 0.099, data_time: 0.002, memory: 3423, loss_cls: 1.0563, loss_bbox: 3.1697, loss_dir: 0.1599, loss: 4.3858, grad_norm: 60.4415
2023-01-09 17:51:13,848 - mmdet - INFO - Epoch [1][250/7424] lr: 9.723e-04, eta: 8:56:17, time: 0.101, data_time: 0.002, memory: 3423, loss_cls: 0.9396, loss_bbox: 1.4337, loss_dir: 0.1429, loss: 2.5162, grad_norm: 12.9785
2023-01-09 17:51:18,777 - mmdet - INFO - Epoch [1][300/7424] lr: 1.107e-03, eta: 8:48:04, time: 0.099, data_time: 0.002, memory: 3423, loss_cls: 0.8953, loss_bbox: 1.5164, loss_dir: 0.1394, loss: 2.5511, grad_norm: 9.2371
2023-01-09 17:51:23,919 - mmdet - INFO - Epoch [1][350/7424] lr: 1.242e-03, eta: 8:45:10, time: 0.103, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2023-01-09 17:51:29,004 - mmdet - INFO - Epoch [1][400/7424] lr: 1.377e-03, eta: 8:42:16, time: 0.102, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2023-01-09 17:51:34,068 - mmdet - INFO - Epoch [1][450/7424] lr: 1.512e-03, eta: 8:39:46, time: 0.101, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2023-01-09 17:51:39,022 - mmdet - INFO - Epoch [1][500/7424] lr: 1.647e-03, eta: 8:36:40, time: 0.099, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2023-01-09 17:51:43,826 - mmdet - INFO - Epoch [1][550/7424] lr: 1.782e-03, eta: 8:32:46, time: 0.096, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2023-01-09 17:51:48,940 - mmdet - INFO - Epoch [1][600/7424] lr: 1.917e-03, eta: 8:32:03, time: 0.102, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2023-01-09 17:51:53,680 - mmdet - INFO - Epoch [1][650/7424] lr: 2.052e-03, eta: 8:28:36, time: 0.095, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2023-01-09 17:51:58,653 - mmdet - INFO - Epoch [1][700/7424] lr: 2.187e-03, eta: 8:27:16, time: 0.099, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2023-01-09 17:52:03,299 - mmdet - INFO - Epoch [1][750/7424] lr: 2.322e-03, eta: 8:23:57, time: 0.093, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2023-01-09 17:52:08,028 - mmdet - INFO - Epoch [1][800/7424] lr: 2.457e-03, eta: 8:21:33, time: 0.095, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2023-01-09 17:52:12,853 - mmdet - INFO - Epoch [1][850/7424] lr: 2.592e-03, eta: 8:19:59, time: 0.097, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2023-01-09 17:52:17,482 - mmdet - INFO - Epoch [1][900/7424] lr: 2.727e-03, eta: 8:17:30, time: 0.093, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2023-01-09 17:52:22,313 - mmdet - INFO - Epoch [1][950/7424] lr: 2.862e-03, eta: 8:16:19, time: 0.097, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2023-01-09 17:52:27,538 - mmdet - INFO - Exp name: dv_mvx-fpn_second_secfpn_adamw_2x8_80e_kitti-3d-3class.py
2023-01-09 17:52:27,538 - mmdet - INFO - Epoch [1][1000/7424] lr: 2.997e-03, eta: 8:17:12, time: 0.105, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2023-01-09 17:52:32,338 - mmdet - INFO - Epoch [1][1050/7424] lr: 3.000e-03, eta: 8:15:59, time: 0.096, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2023-01-09 17:52:37,369 - mmdet - INFO - Epoch [1][1100/7424] lr: 3.000e-03, eta: 8:15:55, time: 0.101, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2023-01-09 17:52:42,261 - mmdet - INFO - Epoch [1][1150/7424] lr: 3.000e-03, eta: 8:15:15, time: 0.098, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
Traceback (most recent call last):
File "tools/train.py", line 265, in
main()
File "tools/train.py", line 254, in main
train_model(
File "/home/raghav/mmdetection3d/mmdet3d/apis/train.py", line 347, in train_model
train_detector(
File "/home/raghav/mmdetection3d/mmdet3d/apis/train.py", line 322, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/home/raghav/miniconda3/envs/mapping/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 136, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/raghav/miniconda3/envs/mapping/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 54, in train
self.call_hook('after_train_iter')
File "/home/raghav/miniconda3/envs/mapping/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 317, in call_hook
getattr(hook, fn_name)(self)
File "/home/raghav/miniconda3/envs/mapping/lib/python3.8/site-packages/mmcv/runner/hooks/optimizer.py", line 67, in after_train_iter
runner.log_buffer.update({'grad_norm': float(grad_norm)},
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Additional information

I am re-training the MVX-Net model on KITTI data set with a reduced number of lidar layers, however, the same problem occurs when I am training it on the 'classic' kitti dataset. After a couple of steps, the training run crashes, saying

RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

I would be really grateful for any help !

The text was updated successfully, but these errors were encountered:

JingweiZhang12 · 2023-01-10T10:58:42Z

#2068. Lowering the learning rate may be helpful.

maxiuw · 2023-01-19T15:40:36Z

not really helping

zwl8979 · 2023-02-02T11:02:01Z

I also encountered the same problem, did you solve this problem? Thanks!

@maxiuw @JingweiZhang12

maxiuw · 2023-02-02T11:17:00Z

Yes, you cannot train it from the checkpoint though. You have to start from zero.

Comment the last line in config file responsible for downloading the checkpoints.

So it isn't really a solution :)

zwl8979 · 2023-02-02T12:23:23Z

I have a question that I hope you can clarify. If you start training from scratch without using a pre-trained model, is the accuracy of the trained model much different from the baseline?
@maxiuw

zwl8979 · 2023-02-02T12:33:50Z

Do you know what caused this? (It can be successful without using the pre-trained model),thanks!
@JingweiZhang12 @maxiuw

JingweiZhang12 · 2023-02-13T02:26:51Z

@maxiuw @zwl8979, Hi, we'll profile the CUDA error of MVX ASAP. Please keep an eye on this issue.

zwl8979 · 2023-02-15T07:06:50Z

@JingweiZhang12 OK，thank you!Looking forward to your reply as soon as possible.

JingweiZhang12 · 2023-02-15T15:48:28Z

@zwl8979 @maxiuw, Hi, we have fixed this bug in the PR #2282

mendoza-G · 2023-03-13T07:29:25Z

Same problem, but working for single GPU , failed with distributed training.

Error message

Traceback (most recent call last):
File "./tools/train.py", line 263, in
main()
File "./tools/train.py", line 252, in main
train_model(
File "/media/xxx/Mendoza_SSD/exp/mmdetection3d/mmdet3d/apis/train.py", line 344, in train_model
train_detector(
File "/media/xxx/Mendoza_SSD/exp/mmdetection3d/mmdet3d/apis/train.py", line 319, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 136, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 53, in train
self.run_iter(data_batch, train_mode=True, **kwargs)
File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 31, in run_iter
outputs = self.model.train_step(data_batch, self.optimizer,
File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/mmcv/parallel/distributed.py", line 63, in train_step
output = self.module.train_step(*inputs[0], **kwargs[0])
File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 248, in train_step
losses = self(**data)
File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 119, in new_func
return old_func(*args, **kwargs)
File "/media/xxx/Mendoza_SSD/exp/mmdetection3d/mmdet3d/models/detectors/base.py", line 60, in forward
return self.forward_train(**kwargs)
File "/media/xxx/Mendoza_SSD/exp/mmdetection3d/mmdet3d/models/detectors/mvx_two_stage.py", line 273, in forward_train
img_feats, pts_feats = self.extract_feat(
File "/media/xxx/Mendoza_SSD/exp/mmdetection3d/mmdet3d/models/detectors/mvx_two_stage.py", line 208, in extract_feat
pts_feats = self.extract_pts_feat(points, img_feats, img_metas)
File "/media/xxx/Mendoza_SSD/exp/mmdetection3d/mmdet3d/models/detectors/mvx_faster_rcnn.py", line 57, in extract_pts_feat
x = self.pts_middle_encoder(voxel_features, feature_coors, batch_size)
File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 119, in new_func
return old_func(*args, **kwargs)
File "/media/xxx/Mendoza_SSD/exp/mmdetection3d/mmdet3d/models/middle_encoders/sparse_encoder.py", line 123, in forward
x = self.conv_input(input_sp_tensor)
File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/mmcv/ops/sparse_modules.py", line 134, in forward
input = module(input)
File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, kwargs)
File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/mmcv/ops/sparse_conv.py", line 183, in forward
out_features = Fsp.indice_subm_conv(features, self.weight,
File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/mmcv/ops/sparse_functional.py", line 110, in forward
return ops.indice_conv(features, filters, indice_pairs,
File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/mmcv/ops/sparse_ops.py", line 124, in indice_conv
return ext_module.indice_conv_forward(features, filters, indice_pairs,
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fa2a5bc9612 in /home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: + 0x22c1e (0x7fa2a5e38c1e in /home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0x22d (0x7fa2a5e3bc4d in /home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x339668 (0x7fa2ef451668 in /home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7fa2a5bae295 in /home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #5: + 0x214cfd (0x7fa2ef32ccfd in /home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: + 0x541188 (0x7fa2ef659188 in /home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: THPVariable_subclass_dealloc(_object) + 0x2b2 (0x7fa2ef659482 in /home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #8: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4b7fa0]
frame #9: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4c2a0f]
frame #10: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4c2863]
frame #11: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4d7a5b]
frame #12: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4d7a5b]
frame #13: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4d1b68]
frame #14: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4e5058]
frame #15: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4e506b]
frame #16: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4e506b]
frame #17: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4e506b]
frame #18: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4e506b]
frame #19: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4e506b]
frame #20: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4e506b]
frame #21: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4e506b]
frame #22: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4b5827]
frame #23: PyDict_SetItemString + 0x99 (0x4bd0a9 in /home/xxx/anaconda3/envs/ppytorch/bin/python)
frame #24: PyImport_Cleanup + 0x93 (0x5911a3 in /home/xxx/anaconda3/envs/ppytorch/bin/python)
frame #25: Py_FinalizeEx + 0x71 (0x58cd81 in /home/xxx/anaconda3/envs/ppytorch/bin/python)
frame #26: Py_RunMain + 0x1b6 (0x584126 in /home/xxx/anaconda3/envs/ppytorch/bin/python)
frame #27: Py_BytesMain + 0x39 (0x561ab9 in /home/xxx/anaconda3/envs/ppytorch/bin/python)
frame #28: __libc_start_main + 0xe7 (0x7fa2fd03dc87 in /lib/x86_64-linux-gnu/libc.so.6)
frame #29: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x56196e]

Env

Python: 3.8.16 (default, Jan 17 2023, 23:13:24) [GCC 11.2.0]
CUDA available: True
GPU 0,1,2,3: NVIDIA GeForce RTX 2080 Ti
CUDA_HOME: /usr/local/cuda-10.2
NVCC: Cuda compilation tools, release 10.2, V10.2.8
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.12.1+cu102
PyTorch compiling details: PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 10.2
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70
CuDNN 7.6.5
Magma 2.5.2
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.12.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

TorchVision: 0.13.1+cu102
OpenCV: 4.2.0
MMCV: 1.7.0
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 10.2
MMDetection: 2.28.2
MMSegmentation: 0.30.0
MMDetection3D: 1.0.0rc6+47285b3
spconv2.0: False

mendoza-G · 2023-03-13T07:30:18Z

Same problem, but working for single GPU , failed with distributed training.

Error message

Traceback (most recent call last): File "./tools/train.py", line 263, in main() File "./tools/train.py", line 252, in main train_model( File "/media/xxx/Mendoza_SSD/exp/mmdetection3d/mmdet3d/apis/train.py", line 344, in train_model train_detector( File "/media/xxx/Mendoza_SSD/exp/mmdetection3d/mmdet3d/apis/train.py", line 319, in train_detector runner.run(data_loaders, cfg.workflow) File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 136, in run epoch_runner(data_loaders[i], **kwargs) File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 53, in train self.run_iter(data_batch, train_mode=True, **kwargs) File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 31, in run_iter outputs = self.model.train_step(data_batch, self.optimizer, File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/mmcv/parallel/distributed.py", line 63, in train_step output = self.module.train_step(*inputs[0], **kwargs[0]) File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 248, in train_step losses = self(**data) File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 119, in new_func return old_func(*args, **kwargs) File "/media/xxx/Mendoza_SSD/exp/mmdetection3d/mmdet3d/models/detectors/base.py", line 60, in forward return self.forward_train(**kwargs) File "/media/xxx/Mendoza_SSD/exp/mmdetection3d/mmdet3d/models/detectors/mvx_two_stage.py", line 273, in forward_train img_feats, pts_feats = self.extract_feat( File "/media/xxx/Mendoza_SSD/exp/mmdetection3d/mmdet3d/models/detectors/mvx_two_stage.py", line 208, in extract_feat pts_feats = self.extract_pts_feat(points, img_feats, img_metas) File "/media/xxx/Mendoza_SSD/exp/mmdetection3d/mmdet3d/models/detectors/mvx_faster_rcnn.py", line 57, in extract_pts_feat x = self.pts_middle_encoder(voxel_features, feature_coors, batch_size) File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 119, in new_func return old_func(*args, **kwargs) File "/media/xxx/Mendoza_SSD/exp/mmdetection3d/mmdet3d/models/middle_encoders/sparse_encoder.py", line 123, in forward x = self.conv_input(input_sp_tensor) File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/mmcv/ops/sparse_modules.py", line 134, in forward input = module(input) File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, _kwargs) File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/mmcv/ops/sparse_conv.py", line 183, in forward out_features = Fsp.indice_subm_conv(features, self.weight, File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/mmcv/ops/sparse_functional.py", line 110, in forward return ops.indice_conv(features, filters, indice_pairs, File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/mmcv/ops/sparse_ops.py", line 124, in indice_conv return ext_module.indice_conv_forward(features, filters, indice_pairs, RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. terminate called after throwing an instance of 'c10::CUDAError' what(): CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fa2a5bc9612 in /home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: + 0x22c1e (0x7fa2a5e38c1e in /home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0x22d (0x7fa2a5e3bc4d in /home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #3: + 0x339668 (0x7fa2ef451668 in /home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7fa2a5bae295 in /home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/torch/lib/libc10.so) frame #5: + 0x214cfd (0x7fa2ef32ccfd in /home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #6: + 0x541188 (0x7fa2ef659188 in /home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #7: THPVariable_subclass_dealloc(object) + 0x2b2 (0x7fa2ef659482 in /home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #8: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4b7fa0] frame #9: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4c2a0f] frame #10: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4c2863] frame #11: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4d7a5b] frame #12: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4d7a5b] frame #13: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4d1b68] frame #14: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4e5058] frame #15: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4e506b] frame #16: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4e506b] frame #17: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4e506b] frame #18: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4e506b] frame #19: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4e506b] frame #20: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4e506b] frame #21: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4e506b] frame #22: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4b5827] frame #23: PyDict_SetItemString + 0x99 (0x4bd0a9 in /home/xxx/anaconda3/envs/ppytorch/bin/python) frame #24: PyImport_Cleanup + 0x93 (0x5911a3 in /home/xxx/anaconda3/envs/ppytorch/bin/python) frame #25: Py_FinalizeEx + 0x71 (0x58cd81 in /home/xxx/anaconda3/envs/ppytorch/bin/python) frame #26: Py_RunMain + 0x1b6 (0x584126 in /home/xxx/anaconda3/envs/ppytorch/bin/python) frame #27: Py_BytesMain + 0x39 (0x561ab9 in /home/xxx/anaconda3/envs/ppytorch/bin/python) frame #28: __libc_start_main + 0xe7 (0x7fa2fd03dc87 in /lib/x86_64-linux-gnu/libc.so.6) frame #29: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x56196e]

Env

Python: 3.8.16 (default, Jan 17 2023, 23:13:24) [GCC 11.2.0] CUDA available: True GPU 0,1,2,3: NVIDIA GeForce RTX 2080 Ti CUDA_HOME: /usr/local/cuda-10.2 NVCC: Cuda compilation tools, release 10.2, V10.2.8 GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 PyTorch: 1.12.1+cu102 PyTorch compiling details: PyTorch built with:
* GCC 7.3

* C++ Version: 201402

* Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications

* Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)

* OpenMP 201511 (a.k.a. OpenMP 4.5)

* LAPACK is enabled (usually provided by MKL)

* NNPACK is enabled

* CPU capability usage: AVX2

* CUDA Runtime 10.2

* NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70

* CuDNN 7.6.5

* Magma 2.5.2
  
  * Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.12.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,
TorchVision: 0.13.1+cu102 OpenCV: 4.2.0 MMCV: 1.7.0 MMCV Compiler: GCC 7.3 MMCV CUDA Compiler: 10.2 MMDetection: 2.28.2 MMSegmentation: 0.30.0 MMDetection3D: 1.0.0rc6+47285b3 spconv2.0: False

Have tried add BN & lower the lr.

Tai-Wang assigned Tai-Wang and JingweiZhang12 and unassigned Tai-Wang Feb 13, 2023

JingweiZhang12 mentioned this issue Feb 15, 2023

[Fix] Add BN in FPN to avoid loss Nan in MVXNet #2282

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] CUDA error: an illegal memory access was encountered when training MVX-NET kitti #2209

[Bug] CUDA error: an illegal memory access was encountered when training MVX-NET kitti #2209

maxiuw commented Jan 9, 2023

JingweiZhang12 commented Jan 10, 2023

maxiuw commented Jan 19, 2023

zwl8979 commented Feb 2, 2023

maxiuw commented Feb 2, 2023

zwl8979 commented Feb 2, 2023

zwl8979 commented Feb 2, 2023 •

edited

Loading

JingweiZhang12 commented Feb 13, 2023

zwl8979 commented Feb 15, 2023

JingweiZhang12 commented Feb 15, 2023

mendoza-G commented Mar 13, 2023

mendoza-G commented Mar 13, 2023

Error message

Env

[Bug] CUDA error: an illegal memory access was encountered when training MVX-NET kitti #2209

[Bug] CUDA error: an illegal memory access was encountered when training MVX-NET kitti #2209

Comments

maxiuw commented Jan 9, 2023

Prerequisite

Task

Branch

Environment

Reproduces the problem - code sample

Reproduces the problem - command or script

Reproduces the problem - error message

Additional information

JingweiZhang12 commented Jan 10, 2023

maxiuw commented Jan 19, 2023

zwl8979 commented Feb 2, 2023

maxiuw commented Feb 2, 2023

zwl8979 commented Feb 2, 2023

zwl8979 commented Feb 2, 2023 • edited Loading

JingweiZhang12 commented Feb 13, 2023

zwl8979 commented Feb 15, 2023

JingweiZhang12 commented Feb 15, 2023

mendoza-G commented Mar 13, 2023

Error message

Env

mendoza-G commented Mar 13, 2023

Error message

Env

zwl8979 commented Feb 2, 2023 •

edited

Loading