Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of memory in evaluation after train epochs #18

Closed
caide199212 opened this issue Feb 22, 2023 · 3 comments
Closed

Out of memory in evaluation after train epochs #18

caide199212 opened this issue Feb 22, 2023 · 3 comments

Comments

@caide199212
Copy link

When reproducing the REDS datasets, I reduced the num_input_frames=20 to fit the GPU memory, so the training is fine, 8G is used for every 16G GPU.
企业微信截图_16770536907007
But when it's evaluating, the GPU consumed by the training iteration seems not released. So the evaluation encounters the OOM as
RuntimeError: CUDA out of memory. Tried to allocate 436.00 MiB (GPU 1; 15.90 GiB total capacity; 10.98 GiB already allocated; 189.81 MiB free; 14.78 GiB reserved in total by PyTorch)
企业微信截图_16770537214755

The config file for the data is as below:
data = dict( workers_per_gpu=4, train_dataloader=dict(samples_per_gpu=1, drop_last=True), test_dataloader=dict(samples_per_gpu=1, workers_per_gpu=1),

BTW, similar problem was reported in mmdet but no feasible solutions was found.

@ericzw
Copy link
Member

ericzw commented Feb 23, 2023

I don't know how to solve this problem.
Maybe you can skip the evaluation during the training process.

@caide199212
Copy link
Author

Good idea, thanks.

@caide199212 caide199212 closed this as not planned Won't fix, can't repro, duplicate, stale Feb 27, 2023
@sunyclj
Copy link

sunyclj commented Jun 2, 2023

I don't know how to solve this problem.
Maybe you can skip the evaluation during the training process.

when setting code to evaluation = dict(interval=1000000, save_image=False, gpu_collect=True),the same error occurs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants