Question on DINOv2 Target Layers and Feature Combination in Ablation Studies



Hi authors,

Very interesting work! I've been reading the paper and found the Geometry Forcing method very insightful.

I had a question regarding the ablation study in **Table 2**, where you compare aligning with VGGT features versus DINOv2 features (and their combination).

**1. Which layers of DINOv2 are used?**
In Section 4.2, you explicitly mention that for VGGT, the target $y$ includes features from all $L$ layers of the backbone ($L \times N \times P \times D$) to capture both local and global information.
For the **DINOv2** experiments, do you also align to **all intermediate layers** of the DINOv2 ViT backbone, or do you only align to the final/high-level semantic layer?

**2. How are the representations combined?**
For the "VGGT + DINOv2" entry in Table 2, could you clarify how the joint supervision is implemented? Are the losses for VGGT and DINOv2 simply added together (e.g., `L_total = L_vggt + L_dino`), or is there a different mechanism for combining these features before alignment?

Thank you for your time and for open-sourcing this work!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question on DINOv2 Target Layers and Feature Combination in Ablation Studies #15

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Question on DINOv2 Target Layers and Feature Combination in Ablation Studies #15

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions