Training speed #11

yongmayer · 2024-05-14T02:05:13Z

Hello,

Thank you so much for sharing this amazing work!

I try to train the model with the NYU dataset. The paper says about 21 mins per epoch for 8 A100 GPUs.
Say, I am using a single A100 GPU with batch-size 32, and in my case, it seems stuck in the following step for ever... Meanwhile, I can run the evaluation without issue. I don't know what the problem could be and I would appreciate any help/hint!

with torch.no_grad():

        # convert the input image to latent space and scale.

        latents = self.encoder_vq.encode(x).mode().detach() * self.config.model.params.scale_factor

P.S., The evaluation results match with the paper well except for sq_rel.

    d1         d2         d3    abs_rel     sq_rel       rmse   rmse_log      log10      silog 
0.9776     0.9973     0.9995     0.0599     0.0194     0.2187     0.0773     0.0259     5.7549

Again, thanks for the great work!

The text was updated successfully, but these errors were encountered:

Aradhye2002 · 2024-05-17T13:28:52Z

Hi @yongmayer, thanks for appreciating our work. So we used a per device batch size of 4 resulting in a total batch size of 32 with 8 GPUs. The speed issue is probably because you are using a per device batch size of 32 instead of 4. Could you try once with a batch size of 4 (with a single GPU ie. you current setup) and let me know if it works?

yongmayer · 2024-05-20T03:15:53Z

Hi @Aradhye2002, Thank you so much! That works!

I have another question if you don't mind asking. How should I understand the diffusion process in EcoDepth?
From line 96 in EcoDepth/depth/models/model.py (EcoDepthEncoder.forward), I see it uses the Unet from stable diffusion, but cannot see the forward diffusion process. Am I understanding it wrong? I am new to the diffusion-based depth estimation, and I would appreciate your kind explanation a lot!

Again, thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training speed #11

Training speed #11

yongmayer commented May 14, 2024 •

edited

Loading

Aradhye2002 commented May 17, 2024

yongmayer commented May 20, 2024 •

edited

Loading

Training speed #11

Training speed #11

Comments

yongmayer commented May 14, 2024 • edited Loading

Aradhye2002 commented May 17, 2024

yongmayer commented May 20, 2024 • edited Loading

yongmayer commented May 14, 2024 •

edited

Loading

yongmayer commented May 20, 2024 •

edited

Loading