Code for "Boosting Visual Knowledge-Intensive Training for LVLMs through Causality-driven Visual Object Completion" (IJCAI 2025)
-
Download the COCO dataset using LAVIS.
-
Format the input into a JSON list. Each entry should contain:
{ "image": "image file", "text_input": "image caption" }
-
Extract entities for each caption:
python cvc/data_preparation/1-0_entity_extractor.py
-
Tag the causality for each entity:
python cvc/data_preparation/1-1_causality_tagger.py
-
Use GLIP to detect bounding boxes of high-causality entities. Download the GLIP checkpoint and run the following script within the GLIP repository:
python cvc/data_preparation/2-1_detect_bbox.py
-
Use SAM to mask high-causality objects:
python cvc/data_preparation/2-2_segment.py
- Generate the specific instruction for each high-causality entity:
python cvc/data_preparation/3_instruction_generator.py
-
Sample multiple rationales (trials) for each CVC instance:
python cvc/model_training/1_cot_generator_llava.py
-
Extract the final answer from each trial:
python cvc/model_training/2_answer_extractor.py
-
Verify the correctness of each trial's answer using soft matching with the BGE-M3 embedding model.
python cvc/model_training/3_answer_checker.py
-
Collect challenging successful CVC instances and construct the training data using hybrid formats. The resulting dataset is combined with the instruction data of LLaVA-1.5:
python cvc/model_training/4_hybrid_format.py
-
Download the pretrained checkpoint of LLaVA-1.5 and use the official LLaVA training script for model training.
This project builds upon the excellent work of several open-source repositories. We sincerely thank the authors for their contributions:
- LLaVA: for the base LVLM architecture and training pipeline
- LAVIS: for dataset downloading
- GLIP: for object detection
Please make sure to install all required dependencies as specified in the respective repositories.