-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training PointPillars on nuScenes mini dataset - "... variables needed for gradient computation has been modified by an inplace operation" #1057
Comments
How do you change the batch_size? You just need to change the param |
That's what happens on the most recent master commit when I update |
After upgrading to a GPU with more memory, the inplace operation error remains (with unmodified
This is happening on the most recent master commit, without any modifications. Does training start normally when you execute |
Using
It seems that for some reason |
If you only train with a single GPU, please replace SyncBN with a general BN for nuScenes models although we do not recommend it. You can try with experiments on KITTI at first. Its size is more friendly for limited computing resources. |
Thanks for the tips. I've replaced SyncBN with general BN:
However, the error remains: (Now using I have been using nuScenes-mini dataset thus far, but I will see if the problem goes away when using KITTI. |
Training works fine on kitti. |
Then I recommend you first train models on KITTI, because using nuScenes-mini usually can not achieve decent performance. We will try to reproduce your bug ASAP. |
Hi, @nout-kleef I cannot reproduce your error on my machine. Would you please run |
Hi @wHao-Wu , Now, training commences without problems ( New env:
Here is a diff against the env I posted initially:
My guess is that the issue was caused by pytorch 1.10.0? |
The environment in my machine is PyTorch 1.5.0. We will try to reproduce this error in the environment of |
@wHao-Wu we are having the same error when start a training on pointpillars. We cannot lower the cuda and torch versions since our gpu is new. What causes this error in the up-to-date torch version? |
|
Found the reason finally. Should use "= +" instead of " += ". https://github.com/open-mmlab/mmdetection/blob/56e42e72cdf516bebb676e586f408b98f854d84c/mmdet/models/necks/fpn.py#L169. New version mmdetect is ok. Reference: https://discuss.pytorch.org/t/element-0-of-tensors-does-not-require-grad-and-does-not-have-a-grad-fn/32908/112 |
Checklist
Describe the issue
I am trying to train PointPillars on nuscenes, but I get
RuntimeError: CUDA out of memory.
errors using the provided implementation.Reproduction
The OOM error occurs without any modification, i.e. when executing the reproduction command.
I tried to solve the issue by adjusting the batch size to a smaller number (I tried 1 and 2). However, this results in a different error:
(I am not sure whether this is the correct way to decrease the batch size.)
NuScenes (
configs/_base_/datasets/nus-3d.py
)Environment
python mmdet3d/utils/collect_env.py
to collect necessary environment infomation and paste it here.This is my first time doing deep learning and using CUDA, so I'm not sure what the best way forward is to solve this issue (other than upgrading my VM to use a bigger GPU).
The text was updated successfully, but these errors were encountered: