Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning [ECCV24]

This is the official repository for Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning [ECCV24], mainly for the proposed framework PCM-Net.

Framework

Salient Visual Concept Detection: For each input image, salient visual concepts are detected based on image-text similarity in CLIP space.
Patch-wise Feature Fusion: Selectively fuses patch-wise visual features with textual features of salient concepts, creating a mixed-up feature map with reduced defects.
Visual-Semantic Encoding: A visual-semantic encoder refines the feature map, which is then used by the sentence decoder for generating captions.
CLIP-weighted Cross-Entropy Loss: A novel loss function prioritizes high-quality image-text pairs over low-quality ones, enhancing model training with synthetic data.

Data Preparation

SynthImgCap Dataset is available.
We use OpenAI-CLIP-Feature to extract the visual CLIP features of synthetic images at training and GT real images at inference.
META ANNO DATA and RELEASED MODELS are released in Google Driver.

Training

Please refer to scripts/train.sh.

Inference

Please refer to scripts/final_eval_for_paper.sh.

Citation

If you use the SynthImgCap dataset or code or models for your research, please cite:

@inproceedings{luo2025unleashing,
  title={Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning},
  author={Luo, Jianjie and Chen, Jingwen and Li, Yehao and Pan, Yingwei and Feng, Jianlin and Chao, Hongyang and Yao, Ting},
  booktitle={European Conference on Computer Vision (ECCV)},
  pages={237--254},
  year={2024},
  organization={Springer}
}

Acknowledgement

This code used resources from X-Modaler Codebase and DenseCLIP code. We thank the authors for open-sourcing their awesome projects.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
configs		configs
data		data
imgs		imgs
scripts		scripts
tools		tools
xmodaler		xmodaler
.gitignore		.gitignore
.readthedocs.yml		.readthedocs.yml
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
test_net.py		test_net.py
train_net.py		train_net.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning [ECCV24]

Framework

Data Preparation

Training

Inference

Citation

Acknowledgement

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

jianjieluo/PCM-Net

Folders and files

Latest commit

History

Repository files navigation

Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning [ECCV24]

Framework

Data Preparation

Training

Inference

Citation

Acknowledgement

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages