Official implementation of the ECCV 2022 Oral paper: Sim-2-Sim Transfer for Vision-and-Language Navigation in Continuous Environments
[Project Page] [Paper]
This project is modified from the VLN-CE repository starting from this commit.
- Initialize the project
git clone --recurse-submodules [email protected]:jacobkrantz/Sim2Sim-VLNCE.git
cd Sim2Sim-VLNCE
conda env create -f environment.yml
conda activate sim2sim
- Install the latest version of Matterport3DSimulator
If you do not want to run experiments with known subgoal candidates, you can skip this install and remove code references to MatterSim
.
- Download the Matterport3D scene meshes
# run with python 2.7
python download_mp.py --task habitat -o data/scene_datasets/mp3d/
# Extract to: ./data/scene_datasets/mp3d/{scene}/{scene}.glb
download_mp.py
must be obtained from the Matterport3D project webpage.
- Download the Room-to-Room episodes in VLN-CE format (link)
gdown https://drive.google.com/uc?id=1T9SjqZWyR2PCLSXYkFckfDeIs6Un0Rjm
# Extract to: ./data/datasets/R2R_VLNCE_v1-3/{split}/{split}.json.gz
- Download the ResNet image encoder
./scripts/download_caffe_models.sh
# this populates ./data/caffe_models/
- Download the MP3D connectivity graphs
./scripts/download_connectivity.sh
# this populates ./connectivity/
We evaluate a discrete VLN agent at various points of transfer to continuous environments. The two model components that enable this are the subgoal generation module and the navigation module, illustrated below:
This repository supports the following evaluations of Recurrent-VLN-BERT. The checkpoint to evaluate can be specified by appending EVAL_CKPT_PATH_DIR path/to/checkpoint.pth
to the run command.
Known subgoals candidates come from the MP3D-Sim navigation graph, just like discrete VLN. The following experiments consider different policies for navigating to selected subgoals.
Teleportation: the discrete VLN task in Habitat
python run.py --exp-config sim2sim_vlnce/config/graph-teleport.yaml
Oracle policy: an A$^*$-based navigator
python run.py --exp-config sim2sim_vlnce/config/graph-oracle_policy.yaml
Local policy: a realistic map-and-plan navigator
python run.py --exp-config sim2sim_vlnce/config/graph-local_policy.yaml
Predicted subgoals from the subgoal generation module (SGM)
python run.py --exp-config sim2sim_vlnce/config/sgm-local_policy.yaml
inference for leaderboard submissions
python run.py \
--run-type inference \
--exp-config sim2sim_vlnce/config/sgm-local_policy-inference.yaml
All experiment configs are set for a GPU with 32GB of RAM. For smaller cards, consider reducing the field RL.POLICY.OBS_TRANSFORMS.RESNET_CANDIDATE_ENCODER.max_batch_size
and IL.batch_size
if necessary.
Training Recurrent-VLN-BERT should be done in that repository. Other panorama-based VLN agents could also be transferred with this Sim2Sim method but are not currently supported.
To train with 3D reconstruction image features, either download them from here (habitat-ResNet-152-places365.tsv
) or generate them yourself:
# ~4.5 hours on a 32GB Tesla V100 GPU.
python scripts/precompute_features.py
[-h]
[--caffe-prototxt CAFFE_PROTOTXT]
[--caffe-model CAFFE_MODEL]
[--save-to SAVE_TO]
[--connectivity CONNECTIVITY]
[--scenes-dir SCENES_DIR]
[--batch-size BATCH_SIZE]
[--gpu-id GPU_ID]
By default, the exact same Caffe ResNet as used in MP3D-Sim is used. We use these features to train both the VLN agent and the SGM. They are a drop-in replacement to the image features captured in MP3D-Sim under the name ResNet-152-places365.tsv
as described in that README.
- Collect trajectories of optimal SGM selections
python run.py \
--run-type collect \
--exp-config sim2sim_vlnce/config/collect_ftune_data.yaml
- Fine-tune the VLN agent
python run.py \
--run-type train \
--exp-config sim2sim_vlnce/config/train_vln_ftune.yaml
We use the vln-sim2real-envs repository (specifically the /actions/
folder) to train the SGM. We use the 3D reconstruction image features described above and train with 360${^\circ}$ vision.
VLN weights [zip]. Extracted format: ./data/models/{Model-Name}
VLN Model | Model Name | Descritption |
---|---|---|
1 | RecVLNBERT.pth |
Published weights from Recurrent-VLN-BERT |
2 | RecVLNBERT_retrained.pth |
Weights when we retrained it ourselves |
3 | RecVLNBERT-ce_vision.pth |
Trained with 3D reconstruction image features |
4 | RecVLNBERT-ce_vision-tuned.pth |
Fine-tunes row 3 in VLN-CE (leaderboard model) |
SGM weights [zip]. Extracted format: ./data/sgm_models/{Model-Name}
SGM Model | Model Name | Descritption |
---|---|---|
1 | sgm_sim2real.pth |
Published weights from VLN Sim2Real |
2 | sgm_sim2sim.pth |
360$^{\circ}$ vision and 3D reconstruction image features |
Our code is MIT licensed. Trained models are considered data derived from the Matterport3D scene dataset and are distributed according to the Matterport3D Terms of Use.
1st Place Solutions for RxR-Habitat Vision-and-Language Navigation Competition Dong An, Zun Wang, Yangguang Li, Yi Wang, Yicong Hong, Yan Huang, Liang Wang, Jing Shao. arXiv 2022
Bridging the Gap Between Learning in Discrete and Continuous Environments for Vision-and-Language Navigation Yicong Hong, Zun Wang, Qi Wu, Stephen Gould. CVPR 2022
Waypoint Models for Instruction-guided Navigation in Continuous Environments Jacob Krantz, Aaron Gokaslan, Dhruv Batra, Stefan Lee, Oleksandr Maksymets. ICCV 2021
Sim-to-Real Transfer for Vision-and-Language Navigation Peter Anderson, Ayush Shrivastava, Joanne Truong, Arjun Majumdar, Devi Parikh, Dhruv Batra, Stefan Lee. CoRL 2021
@inproceedings{krantz2022sim2sim
title={Sim-2-Sim Transfer for Vision-and-Language Navigation in Continuous Environments},
author={Krantz, Jacob and Lee, Stefan},
booktitle={European Conference on Computer Vision (ECCV)},
year={2022}
}