This repo is the official implementation of the ECCV 2024 paper In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation
conda env create -f environment.yml --prefix $YOURPREFIX
$YOUPREFIX
is typically /home/$USER/anaconda3
This repo is built on CLIP, SCLIP, and MMSegmentation.
mim install mmcv==2.0.1 mmengine==0.8.4 mmsegmentation==1.1.1
pip install ftfy regex yapf==0.40.1
Please make it compatible with Pascal VOC 2012, Pascal Context, COCO stuff 164K, COCO object, ADEChallengeData2016, and Cityscapes following the MMSeg data preparation. The COCO-Object dataset can be converted from COCO-Stuff164k by executing the following command:
python datasets/cvt_coco_object.py PATH_TO_COCO_STUFF164K -o PATH_TO_COCO164K
Place them under $yourdatasetroot/
directory such that:
$yourdatasetroot/
├── ADEChallengeData2016/
│ ├── annotations/
│ ├── images/
│ ├── ...
├── VOC2012/
│ ├── Annotations/
│ ├── JPEGImages/
│ ├── ...
├── coco_stuff164k/
│ ├── annotations/
│ ├── images/
│ ├── ...
├── Cityscapes/
│ ├── gtFine/
│ ├── leftImg8bit/
│ ├── ...
├── ...
cd panoptic_cut
python predict.py \
--logs panoptic_cut \
--dataset {coco_object, coco_stuff, ade20k, voc21, voc20, context60, context59, cityscapes} \
--datasetroot $yourdatasetroot
The checkpoints for the panoptic mask discovery is found below google drive:
mask prediction root after stage 1) | benchmark id | Google drive link |
---|---|---|
coco_stuff164k | coco_object, coco_stuff164k | link to download (84.5 MB) |
VOC2012 | context59, context60, voc20, voc21 | link to download (66.7 MB) |
ADEChallengeData2016 | ade20k | link to download (29.4 MB) |
Cityscapes | cityscapes | link to download (23.1 MB) |
Place them under lavg/panoptic_cut/pred/
directory such that:
lavg/panoptic_cut/pred/panoptic_cut/
├── ADEChallengeData2016/
│ ├── ADE_val_00000001.pth
│ ├── ADE_val_00000002.pth
│ ├── ...
├── VOC2012/
│ ├── 2007_000033.pth
│ ├── 2007_000042.pth
│ ├── ...
├── coco_stuff164k/
│ ├── 000000000139.pth
│ ├── 000000000285.pth
│ ├── ...
├── Cityscapes/
│ ├── frankfurt_000000_000294_leftImg8bit.pth
│ ├── ...
Update $yourdatasetroot
in configs/cfg_*.py
cd lavg
python eval.py --config ./configs/{cfg_context59/cfg_context60/cfg_voc20/cfg_voc21}.py --maskpred_root VOC2012/panoptic_cut
python eval.py --config ./configs/cfg_ade20k.py --maskpred_root ADEChallengeData2016/panoptic_cut
python eval.py --config ./configs/{cfg_coco_object/cfg_coco_stuff164k}.py --maskpred_root coco_stuff164k/panoptic_cut
python eval.py --config ./configs/cfg_city_scapes.py --maskpred_root Cityscapes/panoptic_cut
The run is a single-GPU compatible.
With background category | Without background category | |||||||
Method | VOC21 | Context60 | COCO-obj | VOC20 | Context59 | ADE | COCO-stuff | Cityscapes |
LaVG | 62.1 | 31.6 | 34.2 | 82.5 | 34.7 | 15.8 | 23.2 | 26.2 |
Our project refers to and heavily borrows some the codes from the following repos:
This work was supported by Samsung Electronics (IO201208-07822-01), the NRF grant (NRF-2021R1A2C3012728 (45%), and the IITP grants (RS-2022-II220959: Few-Shot Learning of Causal Inference in Vision and Language for Decision Making (50%), RS-2019-II191906: AI Graduate School Program at POSTECH (5%)) funded by Ministry of Science and ICT, Korea. We also thank Sua Choi for her helpful discussion.
If you find our code or paper useful, please consider citing our paper:
@inproceedings{kang2024lazy,
title={In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation},
author={Kang, Dahyun and Cho, Minsu},
booktitle={European Conference on Computer Vision},
year={2024}
}