RuntimeError: CUDA error: out of memory for "DRN-D-105" while testing #49

bahetibhakti · 2019-09-14T13:45:31Z

Is anyone able to test the code for "DRN-D-105" architecture on test data??
I am able to train and validate but while testing error occurs as "RuntimeError: CUDA error: out of memory" even with small crop size = 256*256 and batchsize =1.
I checked resources while testing and resources are free enough (both GPU memory and system RAM)
I am using NVIDIA P100 GPU with 16 GB memory.

Any thought?

(bhakti) user@user:/mnt/komal/bhakti/anue$ python3 segment.py test -d dataset/ -c 26 --arch drn_d_105 --resume model_best.pth.tar --phase test --batch-size 1 -j2
segment.py test -d dataset/ -c 26 --arch drn_d_105 --resume model_best.pth.tar --phase test --batch-size 1 -j2
Namespace(arch='drn_d_105', batch_size=1, bn_sync=False, classes=26, cmd='test', crop_size=896, data_dir='dataset/', epochs=10, evaluate=False, list_dir=None, load_rel=None, lr=0.01, lr_mode='step', momentum=0.9, ms=False, phase='test', pretrained='', random_rotate=0, random_scale=0, resume='model_best.pth.tar', step=200, test_suffix='', weight_decay=0.0001, with_gt=False, workers=2)
classes : 26
batch_size : 1
pretrained :
momentum : 0.9
with_gt : False
phase : test
list_dir : None
lr_mode : step
weight_decay : 0.0001
epochs : 10
step : 200
bn_sync : False
ms : False
arch : drn_d_105
random_rotate : 0
random_scale : 0
workers : 2
crop_size : 896
lr : 0.01
load_rel : None
resume : model_best.pth.tar
evaluate : False
cmd : test
data_dir : dataset/
test_suffix :
[2019-09-14 19:14:23,173 segment.py:697 test_seg] => loading checkpoint 'model_best.pth.tar'
[2019-09-14 19:14:23,509 segment.py:703 test_seg] => loaded checkpoint 'model_best.pth.tar' (epoch 1)
segment.py:540: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead.
image_var = Variable(image, requires_grad=False, volatile=True)
Exception ignored in: <bound method _DataLoaderIter.del of <torch.utils.data.dataloader._DataLoaderIter object at 0x7f15eff61160>>
Traceback (most recent call last):
File "/home/user/anaconda2/envs/bhakti/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 399, in del
self._shutdown_workers()
File "/home/user/anaconda2/envs/bhakti/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 378, in _shutdown_workers
self.worker_result_queue.get()
File "/home/user/anaconda2/envs/bhakti/lib/python3.5/multiprocessing/queues.py", line 337, in get
return ForkingPickler.loads(res)
File "/home/user/anaconda2/envs/bhakti/lib/python3.5/site-packages/torch/multiprocessing/reductions.py", line 151, in rebuild_storage_fd
fd = df.detach()
File "/home/user/anaconda2/envs/bhakti/lib/python3.5/multiprocessing/resource_sharer.py", line 58, in detach
return reduction.recv_handle(conn)
File "/home/user/anaconda2/envs/bhakti/lib/python3.5/multiprocessing/reduction.py", line 181, in recv_handle
return recvfds(s, 1)[0]
File "/home/user/anaconda2/envs/bhakti/lib/python3.5/multiprocessing/reduction.py", line 152, in recvfds
msg, ancdata, flags, addr = sock.recvmsg(1, socket.CMSG_LEN(bytes_size))
ConnectionResetError: [Errno 104] Connection reset by peer
Traceback (most recent call last):
File "segment.py", line 789, in
main()
File "segment.py", line 785, in main
test_seg(args)
File "segment.py", line 720, in test_seg
has_gt=phase != 'test' or args.with_gt, output_dir=out_dir)
File "segment.py", line 544, in test
final = model(image_var)[0]
File "/home/user/anaconda2/envs/bhakti/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/user/anaconda2/envs/bhakti/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 121, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/user/anaconda2/envs/bhakti/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "segment.py", line 142, in forward
y = self.up(x)
File "/home/user/anaconda2/envs/bhakti/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/user/anaconda2/envs/bhakti/lib/python3.5/site-packages/torch/nn/modules/conv.py", line 691, in forward
output_padding, self.groups, self.dilation)
RuntimeError: CUDA error: out of memory

The text was updated successfully, but these errors were encountered:

jwzhi · 2020-03-07T18:20:39Z

same issue

lyxbyr · 2020-04-28T02:06:22Z

this is bug???

lyxbyr · 2020-04-28T03:17:06Z

same issue,could you solve it?

raven38 · 2020-07-14T17:07:55Z

Because the crop_size argument is disabled while testing. The argument is enabled while only training.
Please refer

drn/segment.py

Lines 632 to 640 in d75db2e

    
               dataset = SegListMS(data_dir, phase, transforms.Compose([ 
        
                   transforms.ToTensor(), 
        
                   normalize, 
        
               ]), scales, list_dir=args.list_dir) 
        
           else: 
        
               dataset = SegList(data_dir, phase, transforms.Compose([ 
        
                   transforms.ToTensor(), 
        
                   normalize, 
        
               ]), list_dir=args.list_dir, out_name=True)

and

drn/segment.py

Lines 360 to 383 in d75db2e

    
           t = [] 
        
           if args.random_rotate > 0: 
        
               t.append(transforms.RandomRotate(args.random_rotate)) 
        
           if args.random_scale > 0: 
        
               t.append(transforms.RandomScale(args.random_scale)) 
        
           t.extend([transforms.RandomCrop(crop_size), 
        
                     transforms.RandomHorizontalFlip(), 
        
                     transforms.ToTensor(), 
        
                     normalize]) 
        
           train_loader = torch.utils.data.DataLoader( 
        
               SegList(data_dir, 'train', transforms.Compose(t), 
        
                       list_dir=args.list_dir), 
        
               batch_size=batch_size, shuffle=True, num_workers=num_workers, 
        
               pin_memory=True, drop_last=True 
        
           ) 
        
           val_loader = torch.utils.data.DataLoader( 
        
               SegList(data_dir, 'val', transforms.Compose([ 
        
                   transforms.RandomCrop(crop_size), 
        
                   transforms.ToTensor(), 
        
                   normalize, 
        
               ]), list_dir=args.list_dir), 
        
               batch_size=batch_size, shuffle=False, num_workers=num_workers, 
        
               pin_memory=True, drop_last=True 
        
           )

taesungp · 2020-07-25T11:58:33Z

It's because the new pytorch deprecated volatile, which was used to disable gradient recording. The new recommended way is using torch.no_grad().

In the last line of segment.py, wrap main() with with torch.no_grad():

if __name__ == "__main__":
    with torch.no_grad():
        main()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: CUDA error: out of memory for "DRN-D-105" while testing #49

RuntimeError: CUDA error: out of memory for "DRN-D-105" while testing #49

bahetibhakti commented Sep 14, 2019

jwzhi commented Mar 7, 2020

lyxbyr commented Apr 28, 2020

lyxbyr commented Apr 28, 2020

raven38 commented Jul 14, 2020 •

edited

Loading

taesungp commented Jul 25, 2020

RuntimeError: CUDA error: out of memory for "DRN-D-105" while testing #49

RuntimeError: CUDA error: out of memory for "DRN-D-105" while testing #49

Comments

bahetibhakti commented Sep 14, 2019

jwzhi commented Mar 7, 2020

lyxbyr commented Apr 28, 2020

lyxbyr commented Apr 28, 2020

raven38 commented Jul 14, 2020 • edited Loading

taesungp commented Jul 25, 2020

raven38 commented Jul 14, 2020 •

edited

Loading