Skip to content

RuntimeError: CUDA error: out of memory for "DRN-D-105" while testing #49

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
bahetibhakti opened this issue Sep 14, 2019 · 5 comments

Comments

@bahetibhakti
Copy link

Is anyone able to test the code for "DRN-D-105" architecture on test data??
I am able to train and validate but while testing error occurs as "RuntimeError: CUDA error: out of memory" even with small crop size = 256*256 and batchsize =1.
I checked resources while testing and resources are free enough (both GPU memory and system RAM)
I am using NVIDIA P100 GPU with 16 GB memory.

Any thought?

(bhakti) user@user:/mnt/komal/bhakti/anue$ python3 segment.py test -d dataset/ -c 26 --arch drn_d_105 --resume model_best.pth.tar --phase test --batch-size 1 -j2
segment.py test -d dataset/ -c 26 --arch drn_d_105 --resume model_best.pth.tar --phase test --batch-size 1 -j2
Namespace(arch='drn_d_105', batch_size=1, bn_sync=False, classes=26, cmd='test', crop_size=896, data_dir='dataset/', epochs=10, evaluate=False, list_dir=None, load_rel=None, lr=0.01, lr_mode='step', momentum=0.9, ms=False, phase='test', pretrained='', random_rotate=0, random_scale=0, resume='model_best.pth.tar', step=200, test_suffix='', weight_decay=0.0001, with_gt=False, workers=2)
classes : 26
batch_size : 1
pretrained :
momentum : 0.9
with_gt : False
phase : test
list_dir : None
lr_mode : step
weight_decay : 0.0001
epochs : 10
step : 200
bn_sync : False
ms : False
arch : drn_d_105
random_rotate : 0
random_scale : 0
workers : 2
crop_size : 896
lr : 0.01
load_rel : None
resume : model_best.pth.tar
evaluate : False
cmd : test
data_dir : dataset/
test_suffix :
[2019-09-14 19:14:23,173 segment.py:697 test_seg] => loading checkpoint 'model_best.pth.tar'
[2019-09-14 19:14:23,509 segment.py:703 test_seg] => loaded checkpoint 'model_best.pth.tar' (epoch 1)
segment.py:540: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead.
image_var = Variable(image, requires_grad=False, volatile=True)
Exception ignored in: <bound method _DataLoaderIter.del of <torch.utils.data.dataloader._DataLoaderIter object at 0x7f15eff61160>>
Traceback (most recent call last):
File "/home/user/anaconda2/envs/bhakti/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 399, in del
self._shutdown_workers()
File "/home/user/anaconda2/envs/bhakti/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 378, in _shutdown_workers
self.worker_result_queue.get()
File "/home/user/anaconda2/envs/bhakti/lib/python3.5/multiprocessing/queues.py", line 337, in get
return ForkingPickler.loads(res)
File "/home/user/anaconda2/envs/bhakti/lib/python3.5/site-packages/torch/multiprocessing/reductions.py", line 151, in rebuild_storage_fd
fd = df.detach()
File "/home/user/anaconda2/envs/bhakti/lib/python3.5/multiprocessing/resource_sharer.py", line 58, in detach
return reduction.recv_handle(conn)
File "/home/user/anaconda2/envs/bhakti/lib/python3.5/multiprocessing/reduction.py", line 181, in recv_handle
return recvfds(s, 1)[0]
File "/home/user/anaconda2/envs/bhakti/lib/python3.5/multiprocessing/reduction.py", line 152, in recvfds
msg, ancdata, flags, addr = sock.recvmsg(1, socket.CMSG_LEN(bytes_size))
ConnectionResetError: [Errno 104] Connection reset by peer
Traceback (most recent call last):
File "segment.py", line 789, in
main()
File "segment.py", line 785, in main
test_seg(args)
File "segment.py", line 720, in test_seg
has_gt=phase != 'test' or args.with_gt, output_dir=out_dir)
File "segment.py", line 544, in test
final = model(image_var)[0]
File "/home/user/anaconda2/envs/bhakti/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/user/anaconda2/envs/bhakti/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 121, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/user/anaconda2/envs/bhakti/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "segment.py", line 142, in forward
y = self.up(x)
File "/home/user/anaconda2/envs/bhakti/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/user/anaconda2/envs/bhakti/lib/python3.5/site-packages/torch/nn/modules/conv.py", line 691, in forward
output_padding, self.groups, self.dilation)
RuntimeError: CUDA error: out of memory

@jwzhi
Copy link

jwzhi commented Mar 7, 2020

same issue

@lyxbyr
Copy link

lyxbyr commented Apr 28, 2020

this is bug???

@lyxbyr
Copy link

lyxbyr commented Apr 28, 2020

same issue,could you solve it?

@raven38
Copy link

raven38 commented Jul 14, 2020

Because the crop_size argument is disabled while testing. The argument is enabled while only training.
Please refer

drn/segment.py

Lines 632 to 640 in d75db2e

dataset = SegListMS(data_dir, phase, transforms.Compose([
transforms.ToTensor(),
normalize,
]), scales, list_dir=args.list_dir)
else:
dataset = SegList(data_dir, phase, transforms.Compose([
transforms.ToTensor(),
normalize,
]), list_dir=args.list_dir, out_name=True)

and

drn/segment.py

Lines 360 to 383 in d75db2e

t = []
if args.random_rotate > 0:
t.append(transforms.RandomRotate(args.random_rotate))
if args.random_scale > 0:
t.append(transforms.RandomScale(args.random_scale))
t.extend([transforms.RandomCrop(crop_size),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
normalize])
train_loader = torch.utils.data.DataLoader(
SegList(data_dir, 'train', transforms.Compose(t),
list_dir=args.list_dir),
batch_size=batch_size, shuffle=True, num_workers=num_workers,
pin_memory=True, drop_last=True
)
val_loader = torch.utils.data.DataLoader(
SegList(data_dir, 'val', transforms.Compose([
transforms.RandomCrop(crop_size),
transforms.ToTensor(),
normalize,
]), list_dir=args.list_dir),
batch_size=batch_size, shuffle=False, num_workers=num_workers,
pin_memory=True, drop_last=True
)

@taesungp
Copy link

It's because the new pytorch deprecated volatile, which was used to disable gradient recording. The new recommended way is using torch.no_grad().

In the last line of segment.py, wrap main() with with torch.no_grad():

if __name__ == "__main__":
    with torch.no_grad():
        main()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants