Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run on different gpu? training images get corrupted during training? #187

Open
Shanshan-Huang opened this issue Dec 16, 2020 · 0 comments
Open

Comments

@Shanshan-Huang
Copy link

Shanshan-Huang commented Dec 16, 2020

Dear @xinntao,

I have two questions regarding running the EDVR model.

  1. I realized that everytime I submitted the job to a different gpu (even if they are of the same type e.g. Titan X), I have to do rm build/ and python setup.py develop again, otherwise I would get error in modulated_deformable_im2col_cuda; no kernel image is available for execution on the device. I suspect that it has something to do with dynamic installation? Right now, I had to keep three copies of the same repo in order to simultaneously run 3 jobs. Is it the way to go or is there any better options?
    I followed one of the sugguestions in another posts to have pytorch 1.4, torchvision 0.5 with cudatoolit 10.1

  2. I always get the following error and sometimes even explicit png CRC error when cv2.imdecode() returns None, and I realized that the training png's are somehow corrupted even though I verified all images before training. Did you encounter this problem before? Is it related to multi-processing data loading? This is happening everytime especially when I turn off the TSA and set frame to 1.

Traceback (most recent call last):
  File "basicsr/train.py", line 252, in <module>
    main()
  File "basicsr/train.py", line 234, in main
    train_data = prefetcher.next()
  File "/scratch_net/biwidl216/huangsha/BasicSR_1/basicsr/data/prefetch_dataloader.py", line 76, in next
    return next(self.loader)
  File "/itet-stor/huangsha/net_scratch/conda_envs/test/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/itet-stor/huangsha/net_scratch/conda_envs/test/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 838, in _next_data
    return self._process_data(data)
  File "/itet-stor/huangsha/net_scratch/conda_envs/test/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data
    data.reraise()
  File "/itet-stor/huangsha/net_scratch/conda_envs/test/lib/python3.7/site-packages/torch/_utils.py", line 394, in reraise
    raise self.exc_type(msg)
AttributeError: Caught AttributeError in DataLoader worker process 1.
Original Traceback (most recent call last):
  File "/itet-stor/huangsha/net_scratch/conda_envs/test/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/itet-stor/huangsha/net_scratch/conda_envs/test/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/itet-stor/huangsha/net_scratch/conda_envs/test/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/scratch_net/biwidl216/huangsha/BasicSR_1/basicsr/data/moving_cityscape_dataset.py", line 147, in __getitem__
    img_gt = imfrombytes(img_bytes, float32=True)
  File "/scratch_net/biwidl216/huangsha/BasicSR_1/basicsr/utils/img_util.py", line 125, in imfrombytes
    img = img.astype(np.float32) / 255.
AttributeError: 'NoneType' object has no attribute 'astype'

/scratch/slurm/spool/job219938/slurm_script: line 31: 16770 Bus error               python -u basicsr/train.py -opt options/train/EDVR/train_EDVR_DARK_20_frame_window_1_patch_64.yml

Thank you very much for your help :) and best wishes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant