Skip to content

Question on DINOv2 Target Layers and Feature Combination in Ablation Studies #15

@1chizhang

Description

@1chizhang

Hi authors,

Very interesting work! I've been reading the paper and found the Geometry Forcing method very insightful.

I had a question regarding the ablation study in Table 2, where you compare aligning with VGGT features versus DINOv2 features (and their combination).

1. Which layers of DINOv2 are used?
In Section 4.2, you explicitly mention that for VGGT, the target $y$ includes features from all $L$ layers of the backbone ($L \times N \times P \times D$) to capture both local and global information.
For the DINOv2 experiments, do you also align to all intermediate layers of the DINOv2 ViT backbone, or do you only align to the final/high-level semantic layer?

2. How are the representations combined?
For the "VGGT + DINOv2" entry in Table 2, could you clarify how the joint supervision is implemented? Are the losses for VGGT and DINOv2 simply added together (e.g., L_total = L_vggt + L_dino), or is there a different mechanism for combining these features before alignment?

Thank you for your time and for open-sourcing this work!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions