DINO-Foresight: Looking into the Future with DINO

Efstathios Karypidis^1,3, Ioannis Kakogeorgiou¹, Spyros Gidaris², Nikos Komodakis^1,4,5

¹Archimedes/Athena RC ²valeo.ai
³National Technical University of Athens ⁴University of Crete ⁵IACM-Forth

This repository contains the official implementation of the paper: DINO-Foresight: Looking into the Future with DINO

News

2024-12-17: Arxiv Preprint and GitHub repository are released!

Installation

The code is tested with Python 3.11 and PyTorch 2.0.1+cu121 on Ubuntu 22.04.05 LTS. Create a new conda environment:

conda create -n dinof python=3.11
conda activate dinof

Clone the repository and install the required packages:

pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu121   
git clone https://github.com/Sta8is/DINO-Foresight
cd DINO-Foresight
pip install -r requirements.txt

Dataset Preparation

We use Cityscapes dataset for our experiments. Especially, we use the leftImg8bit_sequence_trainvaltest to train Dino-Foresight feature prediction model. You can download the sequence dataset from the official website.

cityscapes
│
├───leftImg8bit_sequence
│   ├───train
│   ├───val
│   ├───test

Prepare Labels for Library of Heads

For semantic/instance segmentation, we use the leftImg8bit and gtFine packages. To create targets for depth and surface normals we use off the shelf networks. For more details regarding segmentation, depth and surface normals modalities preparation, refer to Preparation of Labels for Library of Heads.

Precompute-PCA

To precompute the PCA matrices different Vision Foundation Models (VFMs) on cityscapes leftImg8bit (2975 images for train) use the pca.py script. For example in order to precompute pca with 1152 components for DinoV2 features extracted from layers 3,6,9,12 using image size 448x896 run the following command:

python pca --feature_extractor dinov2 --layers 2,5,9,11 --image_size 448,896 --n_components 1152 --cityscapes_root /path/to/cityscapes/leftImg8bit

However, the implementation is based on scikit-learn (cpu-based) and may require a lot of RAM memory and time to compute the PCA matrices for features extracted from training set. For this reason, we provide the precomputed PCA checkpoints here. To download the precomputed PCA via command line:

gdown https://drive.google.com/uc?id=1RB_ksbvzN0TGE5HyNVKGrmbLElWu90qt

Dino-Foresight Training

The training of Dino-Foresight is divided into two stages. In the first stage, we train the model at low resolution (224x448) and in the second stage we fine-tune the model at high resolution (448x896).

Stage 1: Train at low resolution 224x448

To train Dino-Foresight at low resolution 224x448 using default hyperparameters run the following command:

python train.py --num_workers=16 --num_workers_val=4 --num_gpus=8 --precision 16-mixed --eval_freq 10 --batch_size 8 --hidden_dim 1152 --heads 8 --layers 12 --dropout 0.1  --max_epochs 800 \
    --eval_mode_during_training --evaluate --single_step_sample_train --lr_base 8e-5 --loss_type SmoothL1 --masking "simple_replace" --seperable_attention --random_horizontal_flip \
    --random_crop --use_fc_bias --data_path="/path/to/cityscapes/leftImg8bit_sequence"  --sequence_length 5 --dinov2_variant "vitb14_reg" --d_layers 2,5,8,11  --train_mask_mode "full_mask" \
    --pca_ckpt "/path/to/pca_448_l[2_5_8_11]_1152.pth" \
    --dst_path /logdir/dino_foresight_lowres_pca_fullmask \

You can also download the pre-trained model from here. To download the pre-trained model via command line:

gdown https://drive.google.com/uc?id=1BjSPVdtjFanh9_-Zr2dXBB_AU5pl0J7l

Stage 2:

To fine-tune Dino-Foresight at high resolution 448x896 using default hyperparameters run the following command:

python train.py --num_workers=16 --num_workers_val=4 --num_gpus=8 --precision 16-mixed --eval_freq 10 --batch_size 1 --hidden_dim 1152 --heads 8 --layers 12 --dropout 0.1  --max_epochs 20 \
    --eval_mode_during_training --evaluate --single_step_sample_train --lr_base 1e-5 --loss_type SmoothL1 --masking "simple_replace" --seperable_attention --random_horizontal_flip --accum_iter 8 \
    --random_crop --use_fc_bias --data_path="/home/ubuntu/cityscapes/leftImg8bit_sequence"  --sequence_length 5 --dinov2_variant "vitb14_reg" --d_layers 2,5,8,11  --img_size 448,896  --train_mask_mode "full_mask"  \
    --pca_ckpt /home/ubuntu/DinoFeatPred/pca/pca_448_l[2_5_8_11]_1152.pth \
    --dst_path /logdir/dino_foresight_highres_pca_fullmask \
    --ckpt /path/to/lowres/ckpt.pth --high_res_adapt

You can also download the pre-trained model from here. To download the pre-trained model via command line:

gdown https://drive.google.com/uc?id=1FllscBnxcZOziEcjkdbZwErjD77UdaQr

Downstream Tasks

We provide the scripts to train and and evaluate a DPT-head (our implementation is mostly based to DPT of DepthAnything) on the downstream tasks of semantic segmentation, depth estimation and surface normal estimation. For more details about the library of DPT-heads, refer to Downstream Tasks.

Evaluation

To evaluate feature prediction on downstream tasks, run the following commands:

Semantic Segmentation

python train.py --num_workers=16 --num_workers_val=4 --num_gpus=4 --precision 16-mixed --eval_freq 10 \
    --batch_size 2 --hidden_dim 1152 --heads 8 --layers 12 --dropout 0.1  --max_epochs 20 \
    --eval_mode_during_training --evaluate --single_step_sample_train --lr_base 1e-5 --loss_type SmoothL1 \
    --masking "simple_replace" --seperable_attention --random_horizontal_flip --accum_iter 8 \
    --random_crop --use_fc_bias --data_path="/path/to/cityscapes/leftImg8bit_sequence"  --sequence_length 5 \
    --dinov2_variant "vitb14_reg" --d_layers 2,5,8,11  --img_size 448,896  --train_mask_mode "full_mask"  \
    --pca_ckpt /path/to/dinov2_pca_448_l[2_5_8_11]_1152.pth \
    --dst_path /home/ubuntu/logdir/dino_foresight_highres_pca_fullmask \
    --ckpt /path/to/dinof_highres.ckpt --eval_ckpt_only \
    --dpt_out_channels 128,256,512,512 --use_bn --nfeats 256 \
    --head_ckpt /path/to/head_segm_pca1152.ckpt --eval_modality "segm" --num_classes 19 \

Depth Estimation

python train.py --num_workers=16 --num_workers_val=4 --num_gpus=4 --precision 16-mixed --eval_freq 10 \
    --batch_size 2 --hidden_dim 1152 --heads 8 --layers 12 --dropout 0.1  --max_epochs 20 \
    --eval_mode_during_training --evaluate --single_step_sample_train --lr_base 1e-5 --loss_type SmoothL1 \
    --masking "simple_replace" --seperable_attention --random_horizontal_flip --accum_iter 8 \
    --random_crop --use_fc_bias --data_path="/path/to/cityscapes/leftImg8bit_sequence"  --sequence_length 5 \
    --dinov2_variant "vitb14_reg" --d_layers 2,5,8,11  --img_size 448,896  --train_mask_mode "full_mask"  \
    --pca_ckpt /path/to/dinov2_pca_448_l[2_5_8_11]_1152.pth \
    --dst_path /home/ubuntu/logdir/dino_foresight_highres_pca_fullmask \
    --ckpt /path/to/dinof_highres.ckpt --eval_ckpt_only \
    --dpt_out_channels 128,256,512,512 --use_bn --nfeats 256 \
    --head_ckpt /path/to/head_depth_pca1152.ckpt --eval_modality "depth" --num_classes 256 \

Surface Normal Estimation

python train.py --num_workers=16 --num_workers_val=4 --num_gpus=4 --precision 16-mixed --eval_freq 10 \
    --batch_size 2 --hidden_dim 1152 --heads 8 --layers 12 --dropout 0.1  --max_epochs 20 \
    --eval_mode_during_training --evaluate --single_step_sample_train --lr_base 1e-5 --loss_type SmoothL1 \
    --masking "simple_replace" --seperable_attention --random_horizontal_flip --accum_iter 8 \
    --random_crop --use_fc_bias --data_path="/path/to/cityscapes/leftImg8bit_sequence"  --sequence_length 5 \
    --dinov2_variant "vitb14_reg" --d_layers 2,5,8,11  --img_size 448,896  --train_mask_mode "full_mask"  \
    --pca_ckpt /path/to/dinov2_pca_448_l[2_5_8_11]_1152.pth \
    --dst_path /home/ubuntu/logdir/dino_foresight_highres_pca_fullmask \
    --ckpt /path/to/dinof_highres.ckpt --eval_ckpt_only \
    --dpt_out_channels 128,256,512,512 --use_bn --nfeats 256 \
    --head_ckpt /path/to/head_normals_pca1152.ckpt --eval_modality "surface_normals" --num_classes 3 \

Demo

We provide 2 quick demos.

Demo.

Citation

If you found DINO-Foresight useful in your research, please consider starring ⭐ us on GitHub and citing 📚 us in your research!

@article{karypidis2024dino,
  title={DINO-Foresight: Looking into the Future with DINO},
  author={Karypidis, Efstathios and Kakogeorgiou, Ioannis and Gidaris, Spyros and Komodakis, Nikos},
  journal={arXiv preprint arXiv:2412.11673},
  year={2024}
}

Acknowledgements

Our code is partially based on Maskgit-pytorch, a pytorch implementation of MaskedGit by ValeoAI. We also thank authors of DINOv2, DPT, DepthAnythingV2, LOTUS for their work and open-source code.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.assets		.assets
Downstream		Downstream
preprocess_scripts		preprocess_scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
demo.ipynb		demo.ipynb
dpt.py		dpt.py
pca.py		pca.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DINO-Foresight: Looking into the Future with DINO

Contents

News

Installation

Dataset Preparation

Prepare Labels for Library of Heads

Precompute-PCA

Dino-Foresight Training

Stage 1: Train at low resolution 224x448

Stage 2:

Downstream Tasks

Evaluation

Semantic Segmentation

Depth Estimation

Surface Normal Estimation

Demo

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

License

Sta8is/DINO-Foresight

Folders and files

Latest commit

History

Repository files navigation

DINO-Foresight: Looking into the Future with DINO

Contents

News

Installation

Dataset Preparation

Prepare Labels for Library of Heads

Precompute-PCA

Dino-Foresight Training

Stage 1: Train at low resolution 224x448

Stage 2:

Downstream Tasks

Evaluation

Semantic Segmentation

Depth Estimation

Surface Normal Estimation

Demo

Citation

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages