Official pytorch implementation of "Weakly Supervised Semantic Segmentation for Driving Scenes"
Weakly Supervised Semantic Segmentation for Driving Scenes
Dongseob Kim*,1 , Seungho Lee*,1 , Junsuk Choe2 , Hyunjung Shim3
1 Yonsei University, 2 Sogang University, and 3 Korea Advanced Institute of Science & Technoloty
* indicates an equal contribution.Abstract State-of-the-art techniques in weakly-supervised semantic segmentation (WSSS) using image-level labels exhibit severe performance degradation on driving scene datasets such as Cityscapes. To address this challenge, we develop a new WSSS framework tailored to driving scene datasets. Based on extensive analysis of dataset characteristics, we employ Contrastive Language-Image Pre-training (CLIP) as our baseline to obtain pseudo-masks. However, CLIP introduces two key challenges: (1) pseudo-masks from CLIP lack in representing small object classes, and (2) these masks contain notable noise. We propose solutions for each issue as follows. (1) We devise Global-Local View Training that seamlessly incorporates small-scale patches during model training, thereby enhancing the model's capability to handle small-sized yet critical objects in driving scenes (e.g., traffic light). (2) We introduce Consistency-Aware Region Balancing (CARB), a novel technique that discerns reliable and noisy regions through evaluating the consistency between CLIP masks and segmentation predictions. It prioritizes reliable pixels over noisy pixels via adaptive loss weighting. Notably, the proposed method achieves 51.8% mIoU on the Cityscapes test dataset, showcasing its potential as a strong WSSS baseline on driving scene datasets. Experimental results on CamVid and WildDash2 demonstrate the effectiveness of our method across diverse datasets, even with small-scale datasets or visually challenging conditions.
21 Mar, 2024: Initial upload
Step 0. Install PyTorch and Torchvision following official instructions, e.g.,
pip install torch torchvision
# FYI, we're using torch==1.9.1 and torchvision==0.10.1
# We used docker image pytorch:1.9.1-cuda11.1-cudnn8-devel
Step 1. Install MMCV.
pip install mmcv-full
# FYI, we're using mmcv-full==1.4.0
Step 2. Install CLIP.
pip install ftfy regex tqdm
pip install git+https://github.com/openai/CLIP.git
Step 3. Install CARB.
git clone https://github.com/k0u-id/CARB.git
cd CARB
pip install -v -e .
# "-v" means verbose, or more output
# "-e" means installing a project in editable mode,
# thus any local modifications made to the code will take effect without reinstallation.
Step 4. Maybe you need. (if error occurs)
sudo apt-get install -y libgl1-mesa-glx libglib2.0-0
sudo apt-get install libmagickwand-dev
pip install yapf==0.40.1
pip install git+https://github.com/lucasb-eyer/pydensecrf.git
In our paper, we experiment with Cityscapes, CamVid, and WildDash2.
- Example directory hierarchy
CARB |--- data | |--- cityscapes | | |---leftImg8bit | | |---gtFine | |--- camvid11 | | |---img | | |---mask | |--- wilddash2 | | |---img | | |---mask |--- work_dirs | |--- output_dirs (config_name) | | ... | ...
Dataset
Pretrained Checkpoint
CARB trains segmentation model with single or dual path. You need to prepair fixed-masks (pseudo-masks) for single path training.
Step 0. Download and convert the CLIP models, e.g.,
python tools/maskclip_utils/convert_clip_weights.py --model ViT16
# Other options for model: RN50, RN101, RN50x4, RN50x16, RN50x64, ViT32, ViT16, ViT14
Step 1. Prepare the text embeddings of the target dataset, e.g.,
python tools/maskclip_utils/prompt_engineering.py --model ViT16 --class-set city_carb
# Other options for model: RN50, RN101, RN50x4, RN50x16, ViT32, ViT16
# Other options for class-set: camvid, wilddash2
# Default option is ViT16, city_carb
Train. Here, we give an example of multiple GPUs on a single machine.
# Please see this file for the detail of execution.
# You can change detailed configuration by changing config files (e.g., CARB/configs/carb/cityscapes_carb_dual.py)
bash tools/train.sh
# Please see this file for the detail of execution.
bash tools/test.sh
This is highly borrowed from MaskCLIP, mmsegmentation. Thanks to Chong, zhou.
If you use CARB or this code base in your work, please cite
@inproceedings{kim2024weakly,
title={Weakly Supervised Semantic Segmentation for Driving Scenes},
author={Kim, Dongseob and Lee, Seungho and Choe, Junsuk and Shim, Hyunjung},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={38},
number={3},
pages={2741--2749},
year={2024}
}