This repository contains the official implementation of our paper,
ICT: Image-Object Cross-Level Trusted Intervention for Mitigating Object Hallucination in Large Vision-Language Models,which has been accepted to CVPR 2025.
- We introduce Image-Object Cross-Level Trusted Intervention (ICT), a light weight and training-free method that calculates an intervention direction to shift the model's focus towards different levels of visual information, enhancing its attention to high-level and fine-grained visual details.
- The new ICT is formulated as follows:
$$\boldsymbol{H}^{(l+1)} = \boldsymbol{H}^{(l)}+ \sum_{n=1}^{N} \Big( Attn _n^{(l)} (\boldsymbol{H}^{(l)}) + \mathbb{I} _{\text{img},n}^{(l)} \alpha \boldsymbol{S} _{n}^{(l)} + \mathbb{I} _{\text{obj},n}^{(l)} \beta \boldsymbol{S} _{\text{obj},n}^{(l)} \Big) \cdot W _o^{(l)}.$$ - The proposed ICT effectively reduces the harmful over-reliance on language prior , a major cause of hallucinations in LVLMs, while preserveing the benifits of the useful ones.
- Image Datasets: Required image datasets for the Pope benchmark.
- Pope Question-Answer Pairs: Ensure you have the necessary question and answer files for Pope.
- set up by runnning
conda env create -f environment.yml
conda activate ict
Run the following scripts to generate different types of intervention vectors using your model and dataset.
python get_base_vector.py --model-path path/to/llava-v1.5 \
--question-file path/to/pope/question-file \
--image-folder path/to/your/coco/images \
--seed ${1:-55} --length 1500 \
--output ./base
python get_hallucinated_vector.py --model-path path/to/llava-v1.5 \
--question-file path/to/pope/question-file \
--image-folder path/to/your/coco/images \
--seed ${1:-55} --length 1500 \
--output ./hallucinated
python get_object_vector.py --model-path path/to/llava-v1.5 \
--question-file path/to/pope/question-file \
--image-folder path/to/your/coco/images \
--seed ${1:-55} --length 1500 \
--output ./object
python val_ict_pope.py --question_file path/to/pope/question-file \
--num_heads 256 --alpha 8 --seed ${1:-55} \
--length 1500 --target_dataset coco \
--type both
Evaluate the generated answers against ground truth annotations.
python eval_pope.py --gt_files path/to/groundtruth/pope/answers \
--gen_files answer.jsonl
MMMU Benchmark
- To evaluate ICT on the MMMU benchmark, first clone the MMMU repository:
git clone https://github.com/MMMU-Benchmark/MMMU.git
- Then, place the necessary files in the /MMMU directory and run:
python val_ict_MMMU.py
- MMMU Evaluation for 13B Models
python val_ict_13b_MMMU.py
PhD Benchmark Evaluation
- To run ICT on the PhD benchmark, execute:
python val_ict_phd.py
If you find our project useful, we hope you can star our repo and cite our paper as follows:
@article{chen2024ict,
title={ICT: Image-Object Cross-Level Trusted Intervention for Mitigating Object Hallucination in Large Vision-Language Models},
author={Chen, Junzhe and Zhang, Tianshu and Huang, Shiyu and Niu, Yuwei and Zhang, Linfeng and Wen, Lijie and Hu, Xuming},
journal={arXiv preprint arXiv:2411.15268},
year={2024}
}