Iterative Vision-and-Language Navigation in Continuous Environments (IVLN-CE)

Jacob Krantz*, Shurjo Banerjee*, Wang Zhu, Jason Corso, Peter Anderson, Stefan Lee, and Jesse Thomason

This is the official implementation of Iterative Vision-and-Language Navigation (IVLN) in continuous environments, a paradigm for evaluating language-guided agents navigating in a persistent environment over time. Existing Vision-and-Language Navigation (VLN) benchmarks erase the agent’s memory at the beginning of every episode, testing the ability to perform cold-start navigation with no prior information. However, deployed robots occupy the same environment for long periods of time. The IVLN paradigm addresses this disparity by training and evaluating VLN agents that maintain memory across tours of scenes that consist of up to 100 ordered instruction-following Room-to-Room (R2R) episodes each defined by an individual language instruction and a target path. This repository implements the Iterative Room-to-Room in Continuous Environments (IR2R-CE) benchmark.

Setup

This project is modified from the VLN-CE repository starting from this commit.

Initialize the project

git clone --recurse-submodules [email protected]:jacobkrantz/Iterative-VLNCE.git
cd Iterative-VLNCE

conda env create -f environment.yml
conda activate ivlnce

Note: if you have runtime issues relating to torch-scatter, reinstall it with the cuda-supported wheel. In my case, this was:

pip install torch-scatter -f https://data.pyg.org/whl/torch-1.10.2+cu113.html

Download the Matterport3D scene meshes

# run with python 2.7
python download_mp.py --task habitat -o data/scene_datasets/mp3d/
# Extract to: ./data/scene_datasets/mp3d/{scene}/{scene}.glb

download_mp.py must be obtained from the Matterport3D project webpage.

Download the Room-to-Room episodes in VLN-CE format (link)

gdown https://drive.google.com/uc?id=1T9SjqZWyR2PCLSXYkFckfDeIs6Un0Rjm
# Extract to: ./data/datasets/R2R_VLNCE_v1-3/{split}/{split}.json.gz

Download files that define tours of episodes:

Weights	Download	Extract Path
Tour ordering	Link (1 MB)	`data/tours.json`
Target paths for t-nDTW eval	Link (132 MB)	`data/gt_ndtw.json`

[OPTIONAL] To run baseline models, the following weights are required:

Weights	Download	Extract Path
ResNet Depth Encoder (DDPPO-trained)	Link (745 MB)	`data/ddppo-models/{model}.pth`
Semantics inference (RedNet)	Link (626 MB)	`data/rednet_mp3d_best_model.pkl`
Pre-trained MapCMA models	Link (608 MB)	`data/checkpoints/{model}.pth`
Pre-computed known maps	Link (78 MB)	`data/known_maps/{semantic-src}/{scene}.npz`

Starter Code

The run.py script controls training and evaluation for all models:

python run.py \
  --exp-config path/to/experiment_config.yaml \
  --run-type {train | eval}

Config files exist for running each experiment detailed in the paper, both for training and for evaluation. The configs for running ground-truth semantics experiments are located in ivlnce_baselines/config/map_cma/gt_semantics and the configs for running predicted semantics experiments are located in ivlnce_baselines/config/map_cma/pred_semantics. Each subfolder {episodic, iterative, known} contains configs for training and evaluating a model with that mapping method. Following the numbered order of config .yaml files in each respective directory will train the model and evaluate it on all mapping modes. The unstructured memory models are represented in the ivlnce_baselines/config/latent_baselines folder.

Evaluating Pre-trained MapCMA Models

The naming convention of pre-trained MapCMA models is [semantics]_[training].pth where semantics is either gt (ground-truth) or pred (predicted from RedNet) and training is the map construction method: either episodic (ep), iterative (it), or known (kn). Each can be evaluated with existing config files. For example, consider a model trained on predicted semantics and with iterative maps (pred_it.pth). To evalaute this model in the same setting, run:

python run.py \
  --run-type eval \
  --exp-config ivlnce_baselines/config/map_cma/pred_semantics/iterative_maps/2_eval_iterative.yaml \
  EVAL_CKPT_PATH_DIR data/checkpoints/pred_it.pth

Similarly, this model can be evaluated with known maps:

python run.py \
  --run-type eval \
  --exp-config ivlnce_baselines/config/map_cma/pred_semantics/iterative_maps/2_eval_iterative.yaml \
  EVAL_CKPT_PATH_DIR data/checkpoints/pred_it.pth

You can look through the configs in ivlnce_baselines/config/map_cma to find a particular training or evaluation configuration of interest.

Training Agents

The DaggerTrainer class is the standard trainer and supports teacher forcing or dataset aggregation (DAgger) of episodic data. We also include the IterativeCollectionDAgger trainer which builds maps iteratively and then trains agents episodically on those maps. The IterativeDAggerTrainer collects and trains models iteratively and is used to train unstructured memory models on IR2R-CE. All trainers inherit from BaseVLNCETrainer.

Training MapCMA

Suppose you want to train a MapCMA model from scratch with predicted semantics and iterative maps, like was done in the paper. First, train on IR2R-CE + augmented tour data using teacher forcing:

python run.py \
  --run-type train \
  --exp-config ivlnce_baselines/config/map_cma/pred_semantics/iterative_maps/0_train_tf.yaml

Then, swap train for eval to evaluate each checkpoint. Take the best performing checkpoint and fine-tune with DAgger on the IR2R-CE tours:

python run.py \
  --run-type train \
  --exp-config ivlnce_baselines/config/map_cma/pred_semantics/iterative_maps/1_ftune_dagger.yaml \
  IL.ckpt_to_load path/to/best/checkpoint.pth

Finally, evaluate each resulting checkpoint to find the best on the val_unseen split:

python run.py \
  --run-type eval \
  --exp-config ivlnce_baselines/config/map_cma/pred_semantics/iterative_maps/2_eval_iterative.yaml

While this tutorial walked through a single example, config sequences are provided for all models in the paper (both latent CMA and MapCMA).

Citation

If you find this work useful, please consider citing:

@article{krantz2022iterative
  title={Iterative Vision-and-Language Navigation},
  author={Krantz, Jacob and Banerjee, Shurjo and Zhu, Wang and Corso, Jason and Anderson, Peter and Lee, Stefan and Thomason, Jesse},
  journal={arXiv preprint arXiv:2210.03087},
  year={2022},
}

License

This codebase is MIT licensed. Trained models and task datasets are considered data derived from the mp3d scene dataset. Matterport3D based task datasets and trained models are distributed with Matterport3D Terms of Use and under CC BY-NC-SA 3.0 US license.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
data		data
habitat-lab @ d6ed1c0		habitat-lab @ d6ed1c0
habitat_extensions		habitat_extensions
ivlnce_baselines		ivlnce_baselines
scripts		scripts
.editorconfig		.editorconfig
.flake8		.flake8
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Iterative Vision-and-Language Navigation in Continuous Environments (IVLN-CE)

Setup

Starter Code

Evaluating Pre-trained MapCMA Models

Training Agents

Training MapCMA

Citation

License

About

Languages

License

jacobkrantz/IVLN-CE

Folders and files

Latest commit

History

Repository files navigation

Iterative Vision-and-Language Navigation in Continuous Environments (IVLN-CE)

Setup

Starter Code

Evaluating Pre-trained MapCMA Models

Training Agents

Training MapCMA

Citation

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages