Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] CUDA error: an illegal memory access was encountered when training MVX-NET kitti #2209

Open
3 tasks done
maxiuw opened this issue Jan 9, 2023 · 11 comments
Open
3 tasks done
Assignees

Comments

@maxiuw
Copy link

maxiuw commented Jan 9, 2023

Prerequisite

Task

I'm using the official example scripts/configs for the officially supported tasks/models/datasets.

Branch

master branch https://github.com/open-mmlab/mmdetection3d

Environment

sys.platform: linux
Python: 3.8.13 (default, Mar 28 2022, 11:38:47) [GCC 7.5.0]
CUDA available: True
GPU 0: NVIDIA GeForce RTX 3090 Ti
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.8, V11.8.89
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
PyTorch: 1.10.2
PyTorch compiling details: PyTorch built with:

  • GCC 7.3
  • C++ Version: 201402
  • Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • LAPACK is enabled (usually provided by MKL)
  • NNPACK is enabled
  • CPU capability usage: AVX2
  • CUDA Runtime 11.3
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  • CuDNN 8.2
  • Magma 2.5.2
  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.11.3
OpenCV: 4.6.0
MMCV: 1.6.2
MMCV Compiler: GCC 9.3
MMCV CUDA Compiler: 11.3
MMDetection: 2.26.0
MMSegmentation: 0.29.1
MMDetection3D: 1.0.0rc5+47285b3
spconv2.0: False

Reproduces the problem - code sample

python tools/train.py configs/mvxnet/dv_mvx-fpn_second_secfpn_adamw_2x8_80e_kitti-3d-3class.py

Reproduces the problem - command or script

python tools/train.py configs/mvxnet/dv_mvx-fpn_second_secfpn_adamw_2x8_80e_kitti-3d-3class.py

Reproduces the problem - error message

2023-01-09 17:50:54,289 - mmdet - INFO - Epoch [1][50/7424] lr: 4.323e-04, eta: 12:27:33, time: 0.151, data_time: 0.044, memory: 2903, loss_cls: 1.2267, loss_bbox: 3.5244, loss_dir: 0.1647, loss: 4.9158, grad_norm: 434.8607
2023-01-09 17:50:59,192 - mmdet - INFO - Epoch [1][100/7424] lr: 5.673e-04, eta: 10:16:19, time: 0.098, data_time: 0.002, memory: 3028, loss_cls: 1.1155, loss_bbox: 2.7561, loss_dir: 0.1583, loss: 4.0298, grad_norm: 66.9439
2023-01-09 17:51:03,828 - mmdet - INFO - Epoch [1][150/7424] lr: 7.023e-04, eta: 9:23:41, time: 0.093, data_time: 0.002, memory: 3351, loss_cls: 1.0779, loss_bbox: 1.9355, loss_dir: 0.1621, loss: 3.1756, grad_norm: 26.9783
2023-01-09 17:51:08,802 - mmdet - INFO - Epoch [1][200/7424] lr: 8.373e-04, eta: 9:05:41, time: 0.099, data_time: 0.002, memory: 3423, loss_cls: 1.0563, loss_bbox: 3.1697, loss_dir: 0.1599, loss: 4.3858, grad_norm: 60.4415
2023-01-09 17:51:13,848 - mmdet - INFO - Epoch [1][250/7424] lr: 9.723e-04, eta: 8:56:17, time: 0.101, data_time: 0.002, memory: 3423, loss_cls: 0.9396, loss_bbox: 1.4337, loss_dir: 0.1429, loss: 2.5162, grad_norm: 12.9785
2023-01-09 17:51:18,777 - mmdet - INFO - Epoch [1][300/7424] lr: 1.107e-03, eta: 8:48:04, time: 0.099, data_time: 0.002, memory: 3423, loss_cls: 0.8953, loss_bbox: 1.5164, loss_dir: 0.1394, loss: 2.5511, grad_norm: 9.2371
2023-01-09 17:51:23,919 - mmdet - INFO - Epoch [1][350/7424] lr: 1.242e-03, eta: 8:45:10, time: 0.103, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2023-01-09 17:51:29,004 - mmdet - INFO - Epoch [1][400/7424] lr: 1.377e-03, eta: 8:42:16, time: 0.102, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2023-01-09 17:51:34,068 - mmdet - INFO - Epoch [1][450/7424] lr: 1.512e-03, eta: 8:39:46, time: 0.101, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2023-01-09 17:51:39,022 - mmdet - INFO - Epoch [1][500/7424] lr: 1.647e-03, eta: 8:36:40, time: 0.099, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2023-01-09 17:51:43,826 - mmdet - INFO - Epoch [1][550/7424] lr: 1.782e-03, eta: 8:32:46, time: 0.096, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2023-01-09 17:51:48,940 - mmdet - INFO - Epoch [1][600/7424] lr: 1.917e-03, eta: 8:32:03, time: 0.102, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2023-01-09 17:51:53,680 - mmdet - INFO - Epoch [1][650/7424] lr: 2.052e-03, eta: 8:28:36, time: 0.095, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2023-01-09 17:51:58,653 - mmdet - INFO - Epoch [1][700/7424] lr: 2.187e-03, eta: 8:27:16, time: 0.099, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2023-01-09 17:52:03,299 - mmdet - INFO - Epoch [1][750/7424] lr: 2.322e-03, eta: 8:23:57, time: 0.093, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2023-01-09 17:52:08,028 - mmdet - INFO - Epoch [1][800/7424] lr: 2.457e-03, eta: 8:21:33, time: 0.095, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2023-01-09 17:52:12,853 - mmdet - INFO - Epoch [1][850/7424] lr: 2.592e-03, eta: 8:19:59, time: 0.097, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2023-01-09 17:52:17,482 - mmdet - INFO - Epoch [1][900/7424] lr: 2.727e-03, eta: 8:17:30, time: 0.093, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2023-01-09 17:52:22,313 - mmdet - INFO - Epoch [1][950/7424] lr: 2.862e-03, eta: 8:16:19, time: 0.097, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2023-01-09 17:52:27,538 - mmdet - INFO - Exp name: dv_mvx-fpn_second_secfpn_adamw_2x8_80e_kitti-3d-3class.py
2023-01-09 17:52:27,538 - mmdet - INFO - Epoch [1][1000/7424] lr: 2.997e-03, eta: 8:17:12, time: 0.105, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2023-01-09 17:52:32,338 - mmdet - INFO - Epoch [1][1050/7424] lr: 3.000e-03, eta: 8:15:59, time: 0.096, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2023-01-09 17:52:37,369 - mmdet - INFO - Epoch [1][1100/7424] lr: 3.000e-03, eta: 8:15:55, time: 0.101, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2023-01-09 17:52:42,261 - mmdet - INFO - Epoch [1][1150/7424] lr: 3.000e-03, eta: 8:15:15, time: 0.098, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
Traceback (most recent call last):
File "tools/train.py", line 265, in
main()
File "tools/train.py", line 254, in main
train_model(
File "/home/raghav/mmdetection3d/mmdet3d/apis/train.py", line 347, in train_model
train_detector(
File "/home/raghav/mmdetection3d/mmdet3d/apis/train.py", line 322, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/home/raghav/miniconda3/envs/mapping/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 136, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/raghav/miniconda3/envs/mapping/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 54, in train
self.call_hook('after_train_iter')
File "/home/raghav/miniconda3/envs/mapping/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 317, in call_hook
getattr(hook, fn_name)(self)
File "/home/raghav/miniconda3/envs/mapping/lib/python3.8/site-packages/mmcv/runner/hooks/optimizer.py", line 67, in after_train_iter
runner.log_buffer.update({'grad_norm': float(grad_norm)},
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Additional information

I am re-training the MVX-Net model on KITTI data set with a reduced number of lidar layers, however, the same problem occurs when I am training it on the 'classic' kitti dataset. After a couple of steps, the training run crashes, saying

RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

I would be really grateful for any help !

@JingweiZhang12
Copy link
Contributor

#2068. Lowering the learning rate may be helpful.

@maxiuw
Copy link
Author

maxiuw commented Jan 19, 2023

not really helping

@zwl8979
Copy link

zwl8979 commented Feb 2, 2023

I also encountered the same problem, did you solve this problem? Thanks!

@maxiuw @JingweiZhang12

@maxiuw
Copy link
Author

maxiuw commented Feb 2, 2023

Yes, you cannot train it from the checkpoint though. You have to start from zero.

Comment the last line in config file responsible for downloading the checkpoints.

So it isn't really a solution :)

@zwl8979
Copy link

zwl8979 commented Feb 2, 2023

I have a question that I hope you can clarify. If you start training from scratch without using a pre-trained model, is the accuracy of the trained model much different from the baseline?
@maxiuw

@zwl8979
Copy link

zwl8979 commented Feb 2, 2023

Do you know what caused this? (It can be successful without using the pre-trained model),thanks!
@JingweiZhang12 @maxiuw

@Tai-Wang Tai-Wang assigned Tai-Wang and JingweiZhang12 and unassigned Tai-Wang Feb 13, 2023
@JingweiZhang12
Copy link
Contributor

@maxiuw @zwl8979, Hi, we'll profile the CUDA error of MVX ASAP. Please keep an eye on this issue.

@zwl8979
Copy link

zwl8979 commented Feb 15, 2023

@JingweiZhang12 OK,thank you!Looking forward to your reply as soon as possible.

@JingweiZhang12
Copy link
Contributor

@zwl8979 @maxiuw, Hi, we have fixed this bug in the PR #2282

@mendoza-G
Copy link

Same problem, but working for single GPU , failed with distributed training.

Error message

Traceback (most recent call last):
File "./tools/train.py", line 263, in
main()
File "./tools/train.py", line 252, in main
train_model(
File "/media/xxx/Mendoza_SSD/exp/mmdetection3d/mmdet3d/apis/train.py", line 344, in train_model
train_detector(
File "/media/xxx/Mendoza_SSD/exp/mmdetection3d/mmdet3d/apis/train.py", line 319, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 136, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 53, in train
self.run_iter(data_batch, train_mode=True, **kwargs)
File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 31, in run_iter
outputs = self.model.train_step(data_batch, self.optimizer,
File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/mmcv/parallel/distributed.py", line 63, in train_step
output = self.module.train_step(*inputs[0], **kwargs[0])
File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 248, in train_step
losses = self(**data)
File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 119, in new_func
return old_func(*args, **kwargs)
File "/media/xxx/Mendoza_SSD/exp/mmdetection3d/mmdet3d/models/detectors/base.py", line 60, in forward
return self.forward_train(**kwargs)
File "/media/xxx/Mendoza_SSD/exp/mmdetection3d/mmdet3d/models/detectors/mvx_two_stage.py", line 273, in forward_train
img_feats, pts_feats = self.extract_feat(
File "/media/xxx/Mendoza_SSD/exp/mmdetection3d/mmdet3d/models/detectors/mvx_two_stage.py", line 208, in extract_feat
pts_feats = self.extract_pts_feat(points, img_feats, img_metas)
File "/media/xxx/Mendoza_SSD/exp/mmdetection3d/mmdet3d/models/detectors/mvx_faster_rcnn.py", line 57, in extract_pts_feat
x = self.pts_middle_encoder(voxel_features, feature_coors, batch_size)
File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 119, in new_func
return old_func(*args, **kwargs)
File "/media/xxx/Mendoza_SSD/exp/mmdetection3d/mmdet3d/models/middle_encoders/sparse_encoder.py", line 123, in forward
x = self.conv_input(input_sp_tensor)
File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/mmcv/ops/sparse_modules.py", line 134, in forward
input = module(input)
File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, kwargs)
File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/mmcv/ops/sparse_conv.py", line 183, in forward
out_features = Fsp.indice_subm_conv(features, self.weight,
File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/mmcv/ops/sparse_functional.py", line 110, in forward
return ops.indice_conv(features, filters, indice_pairs,
File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/mmcv/ops/sparse_ops.py", line 124, in indice_conv
return ext_module.indice_conv_forward(features, filters, indice_pairs,
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fa2a5bc9612 in /home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: + 0x22c1e (0x7fa2a5e38c1e in /home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void
) + 0x22d (0x7fa2a5e3bc4d in /home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x339668 (0x7fa2ef451668 in /home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7fa2a5bae295 in /home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #5: + 0x214cfd (0x7fa2ef32ccfd in /home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: + 0x541188 (0x7fa2ef659188 in /home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: THPVariable_subclass_dealloc(_object
) + 0x2b2 (0x7fa2ef659482 in /home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #8: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4b7fa0]
frame #9: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4c2a0f]
frame #10: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4c2863]
frame #11: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4d7a5b]
frame #12: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4d7a5b]
frame #13: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4d1b68]
frame #14: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4e5058]
frame #15: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4e506b]
frame #16: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4e506b]
frame #17: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4e506b]
frame #18: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4e506b]
frame #19: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4e506b]
frame #20: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4e506b]
frame #21: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4e506b]
frame #22: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4b5827]
frame #23: PyDict_SetItemString + 0x99 (0x4bd0a9 in /home/xxx/anaconda3/envs/ppytorch/bin/python)
frame #24: PyImport_Cleanup + 0x93 (0x5911a3 in /home/xxx/anaconda3/envs/ppytorch/bin/python)
frame #25: Py_FinalizeEx + 0x71 (0x58cd81 in /home/xxx/anaconda3/envs/ppytorch/bin/python)
frame #26: Py_RunMain + 0x1b6 (0x584126 in /home/xxx/anaconda3/envs/ppytorch/bin/python)
frame #27: Py_BytesMain + 0x39 (0x561ab9 in /home/xxx/anaconda3/envs/ppytorch/bin/python)
frame #28: __libc_start_main + 0xe7 (0x7fa2fd03dc87 in /lib/x86_64-linux-gnu/libc.so.6)
frame #29: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x56196e]

Env

Python: 3.8.16 (default, Jan 17 2023, 23:13:24) [GCC 11.2.0]
CUDA available: True
GPU 0,1,2,3: NVIDIA GeForce RTX 2080 Ti
CUDA_HOME: /usr/local/cuda-10.2
NVCC: Cuda compilation tools, release 10.2, V10.2.8
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.12.1+cu102
PyTorch compiling details: PyTorch built with:

  • GCC 7.3

  • C++ Version: 201402

  • Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications

  • Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)

  • OpenMP 201511 (a.k.a. OpenMP 4.5)

  • LAPACK is enabled (usually provided by MKL)

  • NNPACK is enabled

  • CPU capability usage: AVX2

  • CUDA Runtime 10.2

  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70

  • CuDNN 7.6.5

  • Magma 2.5.2

    • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.12.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

TorchVision: 0.13.1+cu102
OpenCV: 4.2.0
MMCV: 1.7.0
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 10.2
MMDetection: 2.28.2
MMSegmentation: 0.30.0
MMDetection3D: 1.0.0rc6+47285b3
spconv2.0: False

@mendoza-G
Copy link

Same problem, but working for single GPU , failed with distributed training.

Error message

Traceback (most recent call last): File "./tools/train.py", line 263, in main() File "./tools/train.py", line 252, in main train_model( File "/media/xxx/Mendoza_SSD/exp/mmdetection3d/mmdet3d/apis/train.py", line 344, in train_model train_detector( File "/media/xxx/Mendoza_SSD/exp/mmdetection3d/mmdet3d/apis/train.py", line 319, in train_detector runner.run(data_loaders, cfg.workflow) File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 136, in run epoch_runner(data_loaders[i], **kwargs) File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 53, in train self.run_iter(data_batch, train_mode=True, **kwargs) File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 31, in run_iter outputs = self.model.train_step(data_batch, self.optimizer, File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/mmcv/parallel/distributed.py", line 63, in train_step output = self.module.train_step(*inputs[0], **kwargs[0]) File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 248, in train_step losses = self(**data) File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 119, in new_func return old_func(*args, **kwargs) File "/media/xxx/Mendoza_SSD/exp/mmdetection3d/mmdet3d/models/detectors/base.py", line 60, in forward return self.forward_train(**kwargs) File "/media/xxx/Mendoza_SSD/exp/mmdetection3d/mmdet3d/models/detectors/mvx_two_stage.py", line 273, in forward_train img_feats, pts_feats = self.extract_feat( File "/media/xxx/Mendoza_SSD/exp/mmdetection3d/mmdet3d/models/detectors/mvx_two_stage.py", line 208, in extract_feat pts_feats = self.extract_pts_feat(points, img_feats, img_metas) File "/media/xxx/Mendoza_SSD/exp/mmdetection3d/mmdet3d/models/detectors/mvx_faster_rcnn.py", line 57, in extract_pts_feat x = self.pts_middle_encoder(voxel_features, feature_coors, batch_size) File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 119, in new_func return old_func(*args, **kwargs) File "/media/xxx/Mendoza_SSD/exp/mmdetection3d/mmdet3d/models/middle_encoders/sparse_encoder.py", line 123, in forward x = self.conv_input(input_sp_tensor) File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/mmcv/ops/sparse_modules.py", line 134, in forward input = module(input) File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, _kwargs) File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/mmcv/ops/sparse_conv.py", line 183, in forward out_features = Fsp.indice_subm_conv(features, self.weight, File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/mmcv/ops/sparse_functional.py", line 110, in forward return ops.indice_conv(features, filters, indice_pairs, File "/home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/mmcv/ops/sparse_ops.py", line 124, in indice_conv return ext_module.indice_conv_forward(features, filters, indice_pairs, RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. terminate called after throwing an instance of 'c10::CUDAError' what(): CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fa2a5bc9612 in /home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: + 0x22c1e (0x7fa2a5e38c1e in /home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0x22d (0x7fa2a5e3bc4d in /home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #3: + 0x339668 (0x7fa2ef451668 in /home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7fa2a5bae295 in /home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/torch/lib/libc10.so) frame #5: + 0x214cfd (0x7fa2ef32ccfd in /home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #6: + 0x541188 (0x7fa2ef659188 in /home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #7: THPVariable_subclass_dealloc(object) + 0x2b2 (0x7fa2ef659482 in /home/xxx/anaconda3/envs/ppytorch/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #8: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4b7fa0] frame #9: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4c2a0f] frame #10: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4c2863] frame #11: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4d7a5b] frame #12: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4d7a5b] frame #13: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4d1b68] frame #14: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4e5058] frame #15: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4e506b] frame #16: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4e506b] frame #17: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4e506b] frame #18: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4e506b] frame #19: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4e506b] frame #20: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4e506b] frame #21: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4e506b] frame #22: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x4b5827] frame #23: PyDict_SetItemString + 0x99 (0x4bd0a9 in /home/xxx/anaconda3/envs/ppytorch/bin/python) frame #24: PyImport_Cleanup + 0x93 (0x5911a3 in /home/xxx/anaconda3/envs/ppytorch/bin/python) frame #25: Py_FinalizeEx + 0x71 (0x58cd81 in /home/xxx/anaconda3/envs/ppytorch/bin/python) frame #26: Py_RunMain + 0x1b6 (0x584126 in /home/xxx/anaconda3/envs/ppytorch/bin/python) frame #27: Py_BytesMain + 0x39 (0x561ab9 in /home/xxx/anaconda3/envs/ppytorch/bin/python) frame #28: __libc_start_main + 0xe7 (0x7fa2fd03dc87 in /lib/x86_64-linux-gnu/libc.so.6) frame #29: /home/xxx/anaconda3/envs/ppytorch/bin/python() [0x56196e]

Env

Python: 3.8.16 (default, Jan 17 2023, 23:13:24) [GCC 11.2.0] CUDA available: True GPU 0,1,2,3: NVIDIA GeForce RTX 2080 Ti CUDA_HOME: /usr/local/cuda-10.2 NVCC: Cuda compilation tools, release 10.2, V10.2.8 GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 PyTorch: 1.12.1+cu102 PyTorch compiling details: PyTorch built with:

* GCC 7.3

* C++ Version: 201402

* Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications

* Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)

* OpenMP 201511 (a.k.a. OpenMP 4.5)

* LAPACK is enabled (usually provided by MKL)

* NNPACK is enabled

* CPU capability usage: AVX2

* CUDA Runtime 10.2

* NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70

* CuDNN 7.6.5

* Magma 2.5.2
  
  * Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.12.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

TorchVision: 0.13.1+cu102 OpenCV: 4.2.0 MMCV: 1.7.0 MMCV Compiler: GCC 7.3 MMCV CUDA Compiler: 10.2 MMDetection: 2.28.2 MMSegmentation: 0.30.0 MMDetection3D: 1.0.0rc6+47285b3 spconv2.0: False

Have tried add BN & lower the lr.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants