Skip to content

Code of the paper "CLIP-Guided Vision-Language Pre-training for Question Answering in 3D Scenes" (CVPRW 2023)

Notifications You must be signed in to change notification settings

codenewww/3D-VQA

 
 

Repository files navigation

CLIP-Guided Vision-Language Pre-training for Question Answering in 3D Scenes

This is the repository of the paper "CLIP-Guided Vision-Language Pre-training for Question Answering in 3D Scenes" (CVPR Workshops '23).

Installation

Code

  1. Install CUDA-enabled PyTorch by following https://pytorch.org/get-started/locally/ (Note that this code has been tested with PyTorch 1.9.0 and 1.10.2 + cudatoolkit 11.3).

  2. Install the remaining necessary dependencies with requirements.txt:

    pip install -r requirements.txt
  3. Compile the CUDA modules for the PointNet++ backbone by running setup.py inside lib/pointnet2/:

    cd lib/pointnet2
    python setup.py install

    (Note that this requires the full CUDA toolkit. If it fails: goto Troubleshooting.

Data

  1. Download the ScanQA dataset under data/qa/.

  2. Download the ScanRefer dataset and unzip it under data/. To download the ScanRefer dataset you need to fill out this form.

  3. Download the ScanNetV2 dataset and put scans/ under data/scannet/. To download the ScanNetV2 dataset, follow https://github.com/daveredrum/ScanRefer/blob/master/data/scannet/README.md.

  4. Generate the top-down image views for all scenes with run_generate(generate_top_down.py renders the top-down image view for a single scene):

     python run_generate.py
  5. Download the PointNet++(-1x) checkpoint from https://github.com/facebookresearch/DepthContrast and store checkpoint under directory: checkpoints/

In the end, the data/ directory should have the following structure:

data/
├── qa/
├── scannet/
│   ├── batch_load_scannet_data.py
│   ├── load_scannet_data.py
│   ├── meta_data/
│   ├── model_util_scannet.py
│   ├── scannet_data
│   ├── scannet_utils.py
│   ├── scans/
│   └── visualize.py
├── ScanRefer_filtered.*
└── top_imgs/

Usage

Pretraining

  • Execute scripts/pretrain.py:

    python scripts/pretrain.py --no_height

Training

  • Execute scripts/train.py:

    • Training with pre-trained weights:

      python scripts/train.py --no_height --tokenizer_name clip --pretrain_src <folder_name_of_ckpt_file>

      <folder_name_of_ckpt_file> corresponds to the folder under outputs/ with the timestamp + (optional) <tag_name>.

    • Training from scratch:

      python scripts/train.py --no_height --tokenizer_name clip

Inference

  • Evaluation of trained models with the val dataset:

    python scripts/eval.py --folder <folder_name> --qa --force --tokenizer_name clip

    <folder_name> corresponds to the folder under outputs/ with the timestamp + <tag_name>.

Troubleshooting

  • Installation of open3d fails:

    user@device:~/3D-VQA-dev$ pip install open3d
    ERROR: Could not find a version that satisfies the requirement open3d (from versions: none)
    ERROR: No matching distribution found for open3d
    • Make sure to generate the topview images on a desktop computer. The device that you are running the training on, might not have a prebuilt open3d package available
    • Comment open3d in requirements.txt and thus omit the installation of open3d on this device
  • Execution of lib/pointnet2/setup.py fails:

    user@device:~/3D-VQA-dev/lib/pointnet2$ python setup.py install
    OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.
    • Make sure that CUDA_HOME is set.
    user@device:~/3D-VQA-dev$ python lib/pointnet2/setup.py install
    FileNotFoundError: [Errno 2] No such file or directory: '_version.py'

BibTeX

@inproceedings{Parelli_2023_CVPR, 
	author = {Maria Parelli and Alexandros Delitzas and Nikolas Hars and Georgios Vlassis and Sotirios Anagnostidis and Gregor Bachmann and Thomas Hofmann}, 
	title = {CLIP-Guided Vision-Language Pre-Training for Question Answering in 3D Scenes}, 
	booktitle = {Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)}, 
	year = {2023}
}

Acknowledgements

This project builds upon ATR-DBI/ScanQA and daveredrum/ScanRefer. It also makes use of openai/CLIP.

About

Code of the paper "CLIP-Guided Vision-Language Pre-training for Question Answering in 3D Scenes" (CVPRW 2023)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 93.8%
  • Cuda 3.5%
  • C++ 2.4%
  • C 0.3%