Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to step further by using off-the-shelf mono-depth instead of the features only? #1

Open
JUGGHM opened this issue Oct 18, 2024 · 6 comments

Comments

@JUGGHM
Copy link

JUGGHM commented Oct 18, 2024

Thank you for this great work and I was impressed by the design before you posted it on ArXiv (I noticed this on OpenReview and I am not a reviewer).

Is it possible to directly employ the monocular depth results? In your design, only the feature is applied but the DPT-head is dropped. But we know that Depth-anything2 can produce high-quality depth. It would be a pity that such prior information is lost. Have you tried some regarding experiments?

@haofeixu
Copy link
Member

Hi, thank you for your insightful question. Indeed, we initially considered directly using monocular depth predictions from Depth Anything. However, the monodepth model predicts relative depth values with unknown scale and shift parameters. For our application in Gaussian splatting, we require multi-view consistent depths, which can be combined into a coherent global 3D representation. We found it challenging to convert the relative depth to scale-consistent depths. This issue becomes even more pronounced as the number of views increases.

On the other hand, we explored an alternative approach of feature-level fusion, which we found worked surprisingly well. The method is also very simple, which avoids the complications associated with aligning relative depth scales. As a result, we opted for this design over relying on direct depth predictions.

It's also worth noting a related observation: when fine-tuning a pre-trained relative depth model for metric depth predictions, a common strategy is to retain only the pre-trained encoder and introduce a new decoder to predict metric depth. Our design shares similarities with this approach.

I hope this helps and we’re happy to continue the discussion if you have further questions or insights.

@JUGGHM
Copy link
Author

JUGGHM commented Oct 20, 2024

Thank you for your detailed and insightful answer!

@Soooooda69
Copy link

Hi, thank you for the great work! I noticed in the paper, you mentioned that the depth is regressed with a Unet from the concatenated monocular features and cost volumes. After roughly read the code, I found that the depth is regressed with a DPT head, is it a misunderstanding for this part?

@haofeixu
Copy link
Member

Hi, the DPT head is mentioned in the last sentence of section "Feature Fusion and Depth Regression" in our paper. Since the UNet regresses depth maps at the downsampled feature resolution, we use an additional DPT head to upsample depth to the full resolution.

@Soooooda69
Copy link

Thanks for the answer! I wonder why not directly use DPT to regress a full resolution depth?

@haofeixu
Copy link
Member

Hi, the low-resolution depth is predicted with a softmax layer by performing a weighted average of all the potential depth candidates, which is compatible with the cost volume representation (i.e., feature matching). The importance of the cost volume is ablated in Table 2 and we found it significantly helps. The combination of U-Net and DPT head can be understood as "matching + regression", which we found works well, as also similarly observed in our unimatch paper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants