-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] CUDA error: an illegal memory access was encountered when training MVX-NET kitti #2209
Comments
#2068. Lowering the learning rate may be helpful. |
not really helping |
I also encountered the same problem, did you solve this problem? Thanks! |
Yes, you cannot train it from the checkpoint though. You have to start from zero. Comment the last line in config file responsible for downloading the checkpoints. So it isn't really a solution :) |
I have a question that I hope you can clarify. If you start training from scratch without using a pre-trained model, is the accuracy of the trained model much different from the baseline? |
Do you know what caused this? (It can be successful without using the pre-trained model),thanks! |
@JingweiZhang12 OK,thank you!Looking forward to your reply as soon as possible. |
Same problem, but working for single GPU , failed with distributed training. Error messageTraceback (most recent call last): EnvPython: 3.8.16 (default, Jan 17 2023, 23:13:24) [GCC 11.2.0]
TorchVision: 0.13.1+cu102 |
Have tried add BN & lower the lr. |
Prerequisite
Task
I'm using the official example scripts/configs for the officially supported tasks/models/datasets.
Branch
master branch https://github.com/open-mmlab/mmdetection3d
Environment
sys.platform: linux
Python: 3.8.13 (default, Mar 28 2022, 11:38:47) [GCC 7.5.0]
CUDA available: True
GPU 0: NVIDIA GeForce RTX 3090 Ti
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.8, V11.8.89
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
PyTorch: 1.10.2
PyTorch compiling details: PyTorch built with:
TorchVision: 0.11.3
OpenCV: 4.6.0
MMCV: 1.6.2
MMCV Compiler: GCC 9.3
MMCV CUDA Compiler: 11.3
MMDetection: 2.26.0
MMSegmentation: 0.29.1
MMDetection3D: 1.0.0rc5+47285b3
spconv2.0: False
Reproduces the problem - code sample
python tools/train.py configs/mvxnet/dv_mvx-fpn_second_secfpn_adamw_2x8_80e_kitti-3d-3class.py
Reproduces the problem - command or script
python tools/train.py configs/mvxnet/dv_mvx-fpn_second_secfpn_adamw_2x8_80e_kitti-3d-3class.py
Reproduces the problem - error message
2023-01-09 17:50:54,289 - mmdet - INFO - Epoch [1][50/7424] lr: 4.323e-04, eta: 12:27:33, time: 0.151, data_time: 0.044, memory: 2903, loss_cls: 1.2267, loss_bbox: 3.5244, loss_dir: 0.1647, loss: 4.9158, grad_norm: 434.8607
2023-01-09 17:50:59,192 - mmdet - INFO - Epoch [1][100/7424] lr: 5.673e-04, eta: 10:16:19, time: 0.098, data_time: 0.002, memory: 3028, loss_cls: 1.1155, loss_bbox: 2.7561, loss_dir: 0.1583, loss: 4.0298, grad_norm: 66.9439
2023-01-09 17:51:03,828 - mmdet - INFO - Epoch [1][150/7424] lr: 7.023e-04, eta: 9:23:41, time: 0.093, data_time: 0.002, memory: 3351, loss_cls: 1.0779, loss_bbox: 1.9355, loss_dir: 0.1621, loss: 3.1756, grad_norm: 26.9783
2023-01-09 17:51:08,802 - mmdet - INFO - Epoch [1][200/7424] lr: 8.373e-04, eta: 9:05:41, time: 0.099, data_time: 0.002, memory: 3423, loss_cls: 1.0563, loss_bbox: 3.1697, loss_dir: 0.1599, loss: 4.3858, grad_norm: 60.4415
2023-01-09 17:51:13,848 - mmdet - INFO - Epoch [1][250/7424] lr: 9.723e-04, eta: 8:56:17, time: 0.101, data_time: 0.002, memory: 3423, loss_cls: 0.9396, loss_bbox: 1.4337, loss_dir: 0.1429, loss: 2.5162, grad_norm: 12.9785
2023-01-09 17:51:18,777 - mmdet - INFO - Epoch [1][300/7424] lr: 1.107e-03, eta: 8:48:04, time: 0.099, data_time: 0.002, memory: 3423, loss_cls: 0.8953, loss_bbox: 1.5164, loss_dir: 0.1394, loss: 2.5511, grad_norm: 9.2371
2023-01-09 17:51:23,919 - mmdet - INFO - Epoch [1][350/7424] lr: 1.242e-03, eta: 8:45:10, time: 0.103, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2023-01-09 17:51:29,004 - mmdet - INFO - Epoch [1][400/7424] lr: 1.377e-03, eta: 8:42:16, time: 0.102, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2023-01-09 17:51:34,068 - mmdet - INFO - Epoch [1][450/7424] lr: 1.512e-03, eta: 8:39:46, time: 0.101, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2023-01-09 17:51:39,022 - mmdet - INFO - Epoch [1][500/7424] lr: 1.647e-03, eta: 8:36:40, time: 0.099, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2023-01-09 17:51:43,826 - mmdet - INFO - Epoch [1][550/7424] lr: 1.782e-03, eta: 8:32:46, time: 0.096, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2023-01-09 17:51:48,940 - mmdet - INFO - Epoch [1][600/7424] lr: 1.917e-03, eta: 8:32:03, time: 0.102, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2023-01-09 17:51:53,680 - mmdet - INFO - Epoch [1][650/7424] lr: 2.052e-03, eta: 8:28:36, time: 0.095, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2023-01-09 17:51:58,653 - mmdet - INFO - Epoch [1][700/7424] lr: 2.187e-03, eta: 8:27:16, time: 0.099, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2023-01-09 17:52:03,299 - mmdet - INFO - Epoch [1][750/7424] lr: 2.322e-03, eta: 8:23:57, time: 0.093, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2023-01-09 17:52:08,028 - mmdet - INFO - Epoch [1][800/7424] lr: 2.457e-03, eta: 8:21:33, time: 0.095, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2023-01-09 17:52:12,853 - mmdet - INFO - Epoch [1][850/7424] lr: 2.592e-03, eta: 8:19:59, time: 0.097, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2023-01-09 17:52:17,482 - mmdet - INFO - Epoch [1][900/7424] lr: 2.727e-03, eta: 8:17:30, time: 0.093, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2023-01-09 17:52:22,313 - mmdet - INFO - Epoch [1][950/7424] lr: 2.862e-03, eta: 8:16:19, time: 0.097, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2023-01-09 17:52:27,538 - mmdet - INFO - Exp name: dv_mvx-fpn_second_secfpn_adamw_2x8_80e_kitti-3d-3class.py
2023-01-09 17:52:27,538 - mmdet - INFO - Epoch [1][1000/7424] lr: 2.997e-03, eta: 8:17:12, time: 0.105, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2023-01-09 17:52:32,338 - mmdet - INFO - Epoch [1][1050/7424] lr: 3.000e-03, eta: 8:15:59, time: 0.096, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2023-01-09 17:52:37,369 - mmdet - INFO - Epoch [1][1100/7424] lr: 3.000e-03, eta: 8:15:55, time: 0.101, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
2023-01-09 17:52:42,261 - mmdet - INFO - Epoch [1][1150/7424] lr: 3.000e-03, eta: 8:15:15, time: 0.098, data_time: 0.002, memory: 3423, loss_cls: nan, loss_bbox: nan, loss_dir: nan, loss: nan, grad_norm: nan
Traceback (most recent call last):
File "tools/train.py", line 265, in
main()
File "tools/train.py", line 254, in main
train_model(
File "/home/raghav/mmdetection3d/mmdet3d/apis/train.py", line 347, in train_model
train_detector(
File "/home/raghav/mmdetection3d/mmdet3d/apis/train.py", line 322, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/home/raghav/miniconda3/envs/mapping/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 136, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/raghav/miniconda3/envs/mapping/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 54, in train
self.call_hook('after_train_iter')
File "/home/raghav/miniconda3/envs/mapping/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 317, in call_hook
getattr(hook, fn_name)(self)
File "/home/raghav/miniconda3/envs/mapping/lib/python3.8/site-packages/mmcv/runner/hooks/optimizer.py", line 67, in after_train_iter
runner.log_buffer.update({'grad_norm': float(grad_norm)},
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Additional information
I am re-training the MVX-Net model on KITTI data set with a reduced number of lidar layers, however, the same problem occurs when I am training it on the 'classic' kitti dataset. After a couple of steps, the training run crashes, saying
I would be really grateful for any help !
The text was updated successfully, but these errors were encountered: