Also see the VQ2D Quickstart Colab Notebook that walks through these instructions.
-
Clone the repository from here.
git clone [email protected]:EGO4D/episodic-memory.git cd episodic-memory/VQ2D export VQ2D_ROOT=$PWD
-
Create conda environment.
conda create -n ego4d_vq2d python=3.8
-
Install pytorch using conda. We rely on cuda-10.2 and cudnn-7.6.5.32 for our experiments.
conda install pytorch==1.8.1 torchvision==0.9.1 torchaudio==0.8.1 cudatoolkit=10.2 -c pytorch
-
Install additional requirements using
pip
.pip install -r requirements.txt
-
Install detectron2.
python -m pip install detectron2 -f \ https://dl.fbaipublicfiles.com/detectron2/wheels/cu111/torch1.8/index.html
-
Install pytracking according to these instructions. Download the pre-trained KYS tracker weights to
$VQ2D_ROOT/pretrained_models/kys.pth
.cd $VQ2D_ROOT/dependencies git clone [email protected]:visionml/pytracking.git git checkout de9cb9bb4f8cad98604fe4b51383a1e66f1c45c0
Note: For installing the spatial-correlation-sampler dependency for pytracking, follow these steps if the pip install fails.
cd $VQ2D_ROOT/dependencies git clone [email protected]:ClementPinard/Pytorch-Correlation-extension.git cd Pytorch-Correlation-extension python setup.py install
-
Create a script
~/enable_em_vq2d.sh
to set necessary environment variables and activate the conda environment.#!/usr/bin/bash # Add anaconda path export PATH="$PATH:<PATH TO anaconda3>/bin" # Activate conda environment source activate ego4d_vq2d CUDA_DIR=<PATH TO cuda-10.2> CUDNN_DIR=<PATH TO cudnn-10.2-v8.0.3> # Add cuda, cudnn paths export CUDA_HOME=$CUDA_DIR export CUDNN_PATH=$CUDNN_DIR/cuda/lib64/libcudnn.so export CUDNN_INCLUDE_DIR=$CUDNN_DIR/cuda/include export CUDNN_LIBRARY=$CUDNN_DIR/cuda/lib64 export CUDACXX=$CUDA_DIR/bin/nvcc export VQ2D_ROOT=<PATH TO episodic-memory repo>/VQ2D"
-
Download the videos as instructed here to
$VQ2D_ROOT/data
.ego4d --output_directory="$VQ2D_ROOT/data" --datasets full_scale # Define ego4d videos directory export EGO4D_VIDEOS_DIR=$VQ2D_ROOT/data/v1/full_scale
-
Download the latest annotations to
$VQ2D_ROOT/data
. We use an updated version (v1.0.5) of the VQ2D annotations which includes fixes to a subset of data (check details here). These primarily affect the train and val splits (and not test split). In local experiments, we find that this leads to improved baseline performance on the val split:# Download the data using the Ego4D CLI. ego4d --output_directory="$VQ2D_ROOT/data" --datasets annotations -y --version v2 # Move out vq annotations to $VQ2D_ROOT/data mv $VQ2D_ROOT/data/v2/annotations/vq_*.json $VQ2D_ROOT/data
-
Process the VQ dataset.
python process_vq_dataset.py --annot-root data --save-root data
-
Extract clips for val and test data from videos. Validate the clips once they are extracted. If validation fails, please re-run the conversion script and it will correct for errors. You can optionally add a
--clip-uids <clip-uid-1> <clip-uid-2> ...
argument to specify the clips to regenerate. You can optionally reduce the video frame resolution by specifying--downscale-height <height>
.# Extract clips (should take 12-24 hours on a machine with 80 CPU cores) python convert_videos_to_clips.py \ --annot-paths data/vq_val.json data/vq_test_unannotated.json \ --save-root data/clips \ --ego4d-videos-root $EGO4D_VIDEOS_DIR \ --num-workers 10 # Increase this for speed # Validate the extracted clips (should take 30 minutes) python tools/validate_extracted_clips.py \ --annot-paths data/vq_val.json data/vq_test_unannotated.json \ --clips-root data/clips
-
Extract images for train and validation data from videos (only needed for training detection models).
# Should take <= 6 hours on a machine with 80 CPU cores python convert_videos_to_images.py \ --annot-paths data/vq_train.json data/vq_val.json \ --save-root data/images \ --ego4d-videos-root $EGO4D_VIDEOS_DIR \ --num-workers 10 # Increase this for speed
Copy scripts/train_2_gpus.sh
or scripts/train_8_gpus.sh
to the required experiment directory and execute it.
EXPT_ROOT=<experiment path>
cp $VQ2D_ROOT/scripts/train_2_gpu.sh $EXPT_ROOT
cd $EXPT_ROOT
chmod +x train_2_gpu.sh && ./train_2_gpu.sh
Important note: Our training code currently supports the baseline released with the Ego4D paper. For improved training mechanisms and architectures, we recommend using code from prior challenge winners.
We split the evaluation into two steps: (1) Extracting per-frame bbox proposals and estimating their similarity to the visual query, and (2) Peak detection and bidirectional tracking to infer the response track. There are two key benefits to this separation:
- Rapid hyperparameter searches for step (2): Step (1) is the most expensive operation as it takes ~24 hours on an 8-GPU + 80-core machine. Once the detections are pre-computed, step (2) only takes ~1-2 hours on the same machine. This allowed us to release improved hyperparameters for step (2) and obtain much better results.
- Decoupling detector model from our inference code for step (2): While we support only training the baseline model from the Ego4D paper, we can support inference with arbitrary models as long as the pre-extracted detection scores are available.
Step (1) Extracting per-frame bbox proposals.
# Note: MODEL_ROOT and DETECTIONS_SAVE_ROOT must be absolute paths
MODEL_ROOT=<path to trained model> # contains model.pth and config.yaml
DETECTIONS_SAVE_ROOT=<path to save pre-computed detections>
cd $VQ2D_ROOT
# Extract per-frame bbox proposals and visual query similarity scores
chmod +x ./scripts/extract_vq_detections.sh
./scripts/extract_vq_detections.sh val $MODEL_ROOT $DETECTIONS_SAVE_ROOT
./scripts/extract_vq_detections.sh test_unannotated $MODEL_ROOT $DETECTIONS_SAVE_ROOT
Step (2) Peak detection and bidirectional tracking.
./scripts/infer_vq.sh $MODEL_ROOT $DETECTIONS_SAVE_ROOT val 8 0.50 0.20
./scripts/infer_vq.sh $MODEL_ROOT $DETECTIONS_SAVE_ROOT test_unannotated 8 0.50 0.20
Notes:
-
To reduce GPU / CPU usage, reduce 8 from step (2) based on your specific system.
-
To get VQ2D evaluation results:
python evaluate_vq.py --gt-file data/vq_val.json --pred-file <path to inference json>
-
To participate in the challenge, submit the inference json obtained for the test_unannotated split on evalai.
For reproducibility and conveneice, we provide pre-trained models and corresponding detection scores for the SiamRCNN, ImprovedBaselines, and NegativeFramesMatter models. They can be downloaded using the ego4d CLI as follows:
python -m ego4d.cli.cli -y --output_directory /path/to/output/ --datasets vq2d_models vq2d_detections
The validation results for these models are shown below.
Method | stAP @ 0.25 | stAP | tAP @ 0.25 | tAP | recall % | success % |
---|---|---|---|---|---|---|
SiamRCNN | 0.153 | 0.058 | 0.225 | 0.134 | 32.919 | 43.244 |
Improved Baselines | 0.195 | 0.078 | 0.258 | 0.157 | 37.882 | 47.903 |
Negative Frames Matter | 0.189 | 0.075 | 0.255 | 0.154 | 37.666 | 47.681 |
These were generated using the following commands:
# SiamRCNN
./scripts/infer_vq.sh $PRETRAINED_ROOT/siam_rcnn_residual $DETECTIONS_ROOT/detections_siam_rcnn_residual val 8 0.50 0.20
# Improved Baselines
./scripts/infer_vq.sh $PRETRAINED_ROOT/improved_baselines $DETECTIONS_ROOT/detections_improved_baselines val 8 0.50 0.20
# Negative Frames Matter
./scripts/infer_vq.sh $PRETRAINED_ROOT/negative_frames_matter $DETECTIONS_ROOT/detections_negative_frames_matter val 8 0.25 0.20
where PRETRAINED_ROOT
and DETECTIONS_ROOT
are the directories where the pretrained-models and detections are saved, respectively. Please cite the following papers/reports for the respective models.
SiamRCNN:
@inproceedings{grauman2022ego4d,
title={Ego4d: Around the world in 3,000 hours of egocentric video},
author={Grauman, Kristen and Westbury, Andrew and Byrne, Eugene and Chavis, Zachary and Furnari, Antonino and Girdhar, Rohit and Hamburger, Jackson and Jiang, Hao and Liu, Miao and Liu, Xingyu and others},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={18995--19012},
year={2022}
}
Improved Baselines:
@article{xu2022negative,
title={Negative Frames Matter in Egocentric Visual Query 2D Localization},
author={Xu, Mengmeng and Fu, Cheng-Yang and Li, Yanghao and Ghanem, Bernard and Perez-Rua, Juan-Manuel and Xiang, Tao},
journal={arXiv preprint arXiv:2208.01949},
year={2022}
}
Negative Frames Matter:
@article{xu2022where,
doi = {10.48550/ARXIV.2211.10528},
url = {https://arxiv.org/abs/2211.10528},
author = {Xu, Mengmeng and Li, Yanghao and Fu, Cheng-Yang and Ghanem, Bernard and Xiang, Tao and Perez-Rua, Juan-Manuel},
title = {Where is my Wallet? Modeling Object Proposal Sets for Egocentric Visual Query Localization},
journal={arXiv preprint arXiv:2211.10528},
year={2022}
}
This codebase relies on detectron2, vq2d_cvpr, PyTracking, pfilter and ActivityNet repositories.