Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training speed #11

Open
yongmayer opened this issue May 14, 2024 · 2 comments
Open

Training speed #11

yongmayer opened this issue May 14, 2024 · 2 comments

Comments

@yongmayer
Copy link

yongmayer commented May 14, 2024

Hello,

Thank you so much for sharing this amazing work!

I try to train the model with the NYU dataset. The paper says about 21 mins per epoch for 8 A100 GPUs.
Say, I am using a single A100 GPU with batch-size 32, and in my case, it seems stuck in the following step for ever... Meanwhile, I can run the evaluation without issue. I don't know what the problem could be and I would appreciate any help/hint!

with torch.no_grad():

        # convert the input image to latent space and scale.

        latents = self.encoder_vq.encode(x).mode().detach() * self.config.model.params.scale_factor

P.S., The evaluation results match with the paper well except for sq_rel.

    d1         d2         d3    abs_rel     sq_rel       rmse   rmse_log      log10      silog 
0.9776     0.9973     0.9995     0.0599     0.0194     0.2187     0.0773     0.0259     5.7549 

Again, thanks for the great work!

@Aradhye2002
Copy link
Owner

Hi @yongmayer, thanks for appreciating our work. So we used a per device batch size of 4 resulting in a total batch size of 32 with 8 GPUs. The speed issue is probably because you are using a per device batch size of 32 instead of 4. Could you try once with a batch size of 4 (with a single GPU ie. you current setup) and let me know if it works?

@yongmayer
Copy link
Author

yongmayer commented May 20, 2024

Hi @Aradhye2002, Thank you so much! That works!

I have another question if you don't mind asking. How should I understand the diffusion process in EcoDepth?
From line 96 in EcoDepth/depth/models/model.py (EcoDepthEncoder.forward), I see it uses the Unet from stable diffusion, but cannot see the forward diffusion process. Am I understanding it wrong? I am new to the diffusion-based depth estimation, and I would appreciate your kind explanation a lot!

Again, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants