Efstathios Karypidis1,3, Ioannis Kakogeorgiou1, Spyros Gidaris2, Nikos Komodakis1,4,5
1Archimedes/Athena RC 2valeo.ai
3National Technical University of Athens 4University of Crete 5IACM-Forth
This repository contains the official implementation of the paper: DINO-Foresight: Looking into the Future with DINO
- News
- Installation
- Dataset Preparation
- Precompute PCA
- Dino-Foresight Training
- Downstream Tasks
- Evaluation
- Demo
- Citation
- Acknowledgements
2024-12-17: Arxiv Preprint and GitHub repository are released!
The code is tested with Python 3.11 and PyTorch 2.0.1+cu121 on Ubuntu 22.04.05 LTS. Create a new conda environment:
conda create -n dinof python=3.11
conda activate dinof
Clone the repository and install the required packages:
pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu121
git clone https://github.com/Sta8is/DINO-Foresight
cd DINO-Foresight
pip install -r requirements.txt
We use Cityscapes dataset for our experiments. Especially, we use the leftImg8bit_sequence_trainvaltest
to train Dino-Foresight feature prediction model. You can download the sequence dataset from the official website.
cityscapes
│
├───leftImg8bit_sequence
│ ├───train
│ ├───val
│ ├───test
For semantic/instance segmentation, we use the leftImg8bit
and gtFine
packages. To create targets for depth and surface normals we use off the shelf networks.
For more details regarding segmentation, depth and surface normals modalities preparation, refer to Preparation of Labels for Library of Heads.
To precompute the PCA matrices different Vision Foundation Models (VFMs) on cityscapes leftImg8bit
(2975 images for train) use the pca.py script. For example in order to precompute pca with 1152 components for DinoV2 features extracted from layers 3,6,9,12 using image size 448x896 run the following command:
python pca --feature_extractor dinov2 --layers 2,5,9,11 --image_size 448,896 --n_components 1152 --cityscapes_root /path/to/cityscapes/leftImg8bit
However, the implementation is based on scikit-learn (cpu-based) and may require a lot of RAM memory and time to compute the PCA matrices for features extracted from training set. For this reason, we provide the precomputed PCA checkpoints here. To download the precomputed PCA via command line:
gdown https://drive.google.com/uc?id=1RB_ksbvzN0TGE5HyNVKGrmbLElWu90qt
The training of Dino-Foresight is divided into two stages. In the first stage, we train the model at low resolution (224x448) and in the second stage we fine-tune the model at high resolution (448x896).
To train Dino-Foresight at low resolution 224x448 using default hyperparameters run the following command:
python train.py --num_workers=16 --num_workers_val=4 --num_gpus=8 --precision 16-mixed --eval_freq 10 --batch_size 8 --hidden_dim 1152 --heads 8 --layers 12 --dropout 0.1 --max_epochs 800 \
--eval_mode_during_training --evaluate --single_step_sample_train --lr_base 8e-5 --loss_type SmoothL1 --masking "simple_replace" --seperable_attention --random_horizontal_flip \
--random_crop --use_fc_bias --data_path="/path/to/cityscapes/leftImg8bit_sequence" --sequence_length 5 --dinov2_variant "vitb14_reg" --d_layers 2,5,8,11 --train_mask_mode "full_mask" \
--pca_ckpt "/path/to/pca_448_l[2_5_8_11]_1152.pth" \
--dst_path /logdir/dino_foresight_lowres_pca_fullmask \
You can also download the pre-trained model from here. To download the pre-trained model via command line:
gdown https://drive.google.com/uc?id=1BjSPVdtjFanh9_-Zr2dXBB_AU5pl0J7l
To fine-tune Dino-Foresight at high resolution 448x896 using default hyperparameters run the following command:
python train.py --num_workers=16 --num_workers_val=4 --num_gpus=8 --precision 16-mixed --eval_freq 10 --batch_size 1 --hidden_dim 1152 --heads 8 --layers 12 --dropout 0.1 --max_epochs 20 \
--eval_mode_during_training --evaluate --single_step_sample_train --lr_base 1e-5 --loss_type SmoothL1 --masking "simple_replace" --seperable_attention --random_horizontal_flip --accum_iter 8 \
--random_crop --use_fc_bias --data_path="/home/ubuntu/cityscapes/leftImg8bit_sequence" --sequence_length 5 --dinov2_variant "vitb14_reg" --d_layers 2,5,8,11 --img_size 448,896 --train_mask_mode "full_mask" \
--pca_ckpt /home/ubuntu/DinoFeatPred/pca/pca_448_l[2_5_8_11]_1152.pth \
--dst_path /logdir/dino_foresight_highres_pca_fullmask \
--ckpt /path/to/lowres/ckpt.pth --high_res_adapt
You can also download the pre-trained model from here. To download the pre-trained model via command line:
gdown https://drive.google.com/uc?id=1FllscBnxcZOziEcjkdbZwErjD77UdaQr
We provide the scripts to train and and evaluate a DPT-head (our implementation is mostly based to DPT of DepthAnything) on the downstream tasks of semantic segmentation, depth estimation and surface normal estimation. For more details about the library of DPT-heads, refer to Downstream Tasks.
To evaluate feature prediction on downstream tasks, run the following commands:
python train.py --num_workers=16 --num_workers_val=4 --num_gpus=4 --precision 16-mixed --eval_freq 10 \
--batch_size 2 --hidden_dim 1152 --heads 8 --layers 12 --dropout 0.1 --max_epochs 20 \
--eval_mode_during_training --evaluate --single_step_sample_train --lr_base 1e-5 --loss_type SmoothL1 \
--masking "simple_replace" --seperable_attention --random_horizontal_flip --accum_iter 8 \
--random_crop --use_fc_bias --data_path="/path/to/cityscapes/leftImg8bit_sequence" --sequence_length 5 \
--dinov2_variant "vitb14_reg" --d_layers 2,5,8,11 --img_size 448,896 --train_mask_mode "full_mask" \
--pca_ckpt /path/to/dinov2_pca_448_l[2_5_8_11]_1152.pth \
--dst_path /home/ubuntu/logdir/dino_foresight_highres_pca_fullmask \
--ckpt /path/to/dinof_highres.ckpt --eval_ckpt_only \
--dpt_out_channels 128,256,512,512 --use_bn --nfeats 256 \
--head_ckpt /path/to/head_segm_pca1152.ckpt --eval_modality "segm" --num_classes 19 \
python train.py --num_workers=16 --num_workers_val=4 --num_gpus=4 --precision 16-mixed --eval_freq 10 \
--batch_size 2 --hidden_dim 1152 --heads 8 --layers 12 --dropout 0.1 --max_epochs 20 \
--eval_mode_during_training --evaluate --single_step_sample_train --lr_base 1e-5 --loss_type SmoothL1 \
--masking "simple_replace" --seperable_attention --random_horizontal_flip --accum_iter 8 \
--random_crop --use_fc_bias --data_path="/path/to/cityscapes/leftImg8bit_sequence" --sequence_length 5 \
--dinov2_variant "vitb14_reg" --d_layers 2,5,8,11 --img_size 448,896 --train_mask_mode "full_mask" \
--pca_ckpt /path/to/dinov2_pca_448_l[2_5_8_11]_1152.pth \
--dst_path /home/ubuntu/logdir/dino_foresight_highres_pca_fullmask \
--ckpt /path/to/dinof_highres.ckpt --eval_ckpt_only \
--dpt_out_channels 128,256,512,512 --use_bn --nfeats 256 \
--head_ckpt /path/to/head_depth_pca1152.ckpt --eval_modality "depth" --num_classes 256 \
python train.py --num_workers=16 --num_workers_val=4 --num_gpus=4 --precision 16-mixed --eval_freq 10 \
--batch_size 2 --hidden_dim 1152 --heads 8 --layers 12 --dropout 0.1 --max_epochs 20 \
--eval_mode_during_training --evaluate --single_step_sample_train --lr_base 1e-5 --loss_type SmoothL1 \
--masking "simple_replace" --seperable_attention --random_horizontal_flip --accum_iter 8 \
--random_crop --use_fc_bias --data_path="/path/to/cityscapes/leftImg8bit_sequence" --sequence_length 5 \
--dinov2_variant "vitb14_reg" --d_layers 2,5,8,11 --img_size 448,896 --train_mask_mode "full_mask" \
--pca_ckpt /path/to/dinov2_pca_448_l[2_5_8_11]_1152.pth \
--dst_path /home/ubuntu/logdir/dino_foresight_highres_pca_fullmask \
--ckpt /path/to/dinof_highres.ckpt --eval_ckpt_only \
--dpt_out_channels 128,256,512,512 --use_bn --nfeats 256 \
--head_ckpt /path/to/head_normals_pca1152.ckpt --eval_modality "surface_normals" --num_classes 3 \
We provide 2 quick demos.
- Demo.
If you found DINO-Foresight useful in your research, please consider starring ⭐ us on GitHub and citing 📚 us in your research!
@article{karypidis2024dino,
title={DINO-Foresight: Looking into the Future with DINO},
author={Karypidis, Efstathios and Kakogeorgiou, Ioannis and Gidaris, Spyros and Komodakis, Nikos},
journal={arXiv preprint arXiv:2412.11673},
year={2024}
}
Our code is partially based on Maskgit-pytorch, a pytorch implementation of MaskedGit by ValeoAI. We also thank authors of DINOv2, DPT, DepthAnythingV2, LOTUS for their work and open-source code.