-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Hi authors,
Very interesting work! I've been reading the paper and found the Geometry Forcing method very insightful.
I had a question regarding the ablation study in Table 2, where you compare aligning with VGGT features versus DINOv2 features (and their combination).
1. Which layers of DINOv2 are used?
In Section 4.2, you explicitly mention that for VGGT, the target
For the DINOv2 experiments, do you also align to all intermediate layers of the DINOv2 ViT backbone, or do you only align to the final/high-level semantic layer?
2. How are the representations combined?
For the "VGGT + DINOv2" entry in Table 2, could you clarify how the joint supervision is implemented? Are the losses for VGGT and DINOv2 simply added together (e.g., L_total = L_vggt + L_dino), or is there a different mechanism for combining these features before alignment?
Thank you for your time and for open-sourcing this work!