Forensics-Bench

HomePage | arXiv | Dataset |

This repository is the official implementation of Forensics-Bench.

Forensics-Bench: A Comprehensive Forgery Detection Benchmark Suite for Large Vision Language Models
Jin Wang^*, Chenghui Lv^*, Xian Li, Shichao Dong, Huadong Li, Kelu Yao, Chao Li, Wenqi Shao, Ping Luo^† ^* JW and CL contribute equally (primary contact: [email protected]).
^† Ping Luo is correponding author.

News 💡

2025-03-24: We release the evaluation code.
2025-03-22: We released the Forensics-Bench dataset.
2025-02-27: Our paper has been accepted to CVPR 2025!

Introduction

Forensics-Bench is a new forgery detection evaluation benchmark suite to assess LVLMs across massive forgery detection tasks, requiring comprehensive recognition, location and reasoning capabilities on diverse forgeries. Forensics-Bench comprises 63,292 meticulously curated multi-choice visual questions, covering 112 unique forgery detection types from 5 perspectives: forgery semantics, forgery modalities, forgery tasks, forgery types and forgery models. We conduct thorough evaluations on 22 open-sourced LVLMs and 3 proprietary models GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet, highlighting the significant challenges of comprehensive forgery detection posed by Forensics-Bench. We anticipate that Forensics-Bench will motivate the community to advance the frontier of LVLMs, striving for all-around forgery detectors in the era of AIGC.

Quickstart

We use VLMEevalKit as our evaluation framework. It is a highly user-friendly framework that requires only minimal configuration to get started.

Before running the evaluation scripts, you need to configure the VLMs and correctly set the model_paths in vlmeval/config.py.

After that, you can use a single script run.py to inference and evaluate VLMs on the Forensics-Bench.

Step 0. Installation & Setup essential keys

Installation.

git clone https://github.com/Forensics-Bench/Forensics-Bench.git
cd Forensics-Bench
pip install -e .

Setup Keys.

To infer with API models (GPT-4o, Gemini-Pro, etc.) or use LLM APIs as the judge or choice extractor, you need to first setup API keys. VLMEvalKit will use an judge LLM to extract answer from the output if you set the key, otherwise it uses the exact matching mode (find "Yes", "No", "A", "B", "C"... in the output strings). The exact matching can only be applied to the Yes-or-No tasks and the Multi-choice tasks.

You can place the required keys in $Forensics-Bench/.env or directly set them as the environment variable. If you choose to create a .env file, its content will look like:

# The .env file, place it under $VLMEvalKit
# API Keys of Proprietary VLMs
# QwenVL APIs
DASHSCOPE_API_KEY=
# Gemini w. Google Cloud Backends
GOOGLE_API_KEY=
# OpenAI API
OPENAI_API_KEY=
OPENAI_API_BASE=
# StepAI API
STEPAI_API_KEY=
# REKA API
REKA_API_KEY=
# GLMV API
GLMV_API_KEY=
# CongRong API
CW_API_BASE=
CW_API_KEY=
# SenseChat-V API
SENSECHAT_AK=
SENSECHAT_SK=
# Hunyuan-Vision API
HUNYUAN_SECRET_KEY=
HUNYUAN_SECRET_ID=
# LMDeploy API
LMDEPLOY_API_BASE=
# You can also set a proxy for calling api models during the evaluation stage
EVAL_PROXY=

Fill the blanks with your API keys (if necessary). Those API keys will be automatically loaded when doing the inference and evaluation.

Step 1. Configuration

VLM Configuration: All VLMs are configured in vlmeval/config.py. Few legacy VLMs (like MiniGPT-4, LLaVA-v1-7B) requires additional configuration (configuring the code / model_weight root in the config file). During evaluation, you should use the model name specified in supported_VLM in vlmeval/config.py to select the VLM. Make sure you can successfully infer with the VLM before starting the evaluation with the following command vlmutil check {MODEL_NAME}.

Following VLMs require the configuration step:
Code Preparation & Installation: InstructBLIP (LAVIS), LLaVA & LLaVA-Next & Yi-VL (LLaVA), mPLUG-Owl2 (mPLUG-Owl2), DeepSeek-VL (DeepSeek-VL).
Manual Weight Preparation: For InstructBLIP, you also need to modify the config files in vlmeval/vlm/misc to configure LLM path and ckpt path.

Dataset Configuration: Download the dataset ForensicsBench.tsv from Hugging Face and place it in the Forensics-Bench/LMUData directory.

Transformers Version Recommendation:

Note that some VLMs may not be able to run under certain transformer versions, we recommend the following settings to evaluate each VLM:

Please use transformers==4.33.0 for: Qwen series, Monkey series, InternLM-XComposer Series, mPLUG-Owl2, VisualGLM, MMAlaya, InstructBLIP series.
Please use transformers==4.37.0 for: LLaVA series, ShareGPT4V series, LLaVA (XTuner), CogVLM Series, Yi-VL Series, DeepSeek-VL series, InternVL series.
Please use transformers==latest for: LLaVA-Next series.

Step 2. Evaluation

We use run.py for evaluation. To use the script, you can use $Forensics-Bench/run.py or create a soft-link of the script (to use the script anywhere):

Arguments

--data (list[str]): Set the dataset names ForensicsBench.
--model (list[str]): Set the VLM names that are supported in VLMEvalKit (defined in supported_VLM in vlmeval/config.py).
--mode (str, default to 'all', choices are ['all', 'infer']): When mode set to "all", will perform both inference and evaluation; when set to "infer", will only perform the inference.
--api-nproc (int, default to 4): The number of threads for OpenAI API calling.
--work-dir (str, default to '.'): The directory to save evaluation results.

Command for Evaluating Forensics-Bench: You can run the script with python:

# When running with `python`, only one VLM instance is instantiated, and it might use multiple GPUs (depending on its default behavior).
# That is recommended for evaluating very large VLMs.

# InternVL-Chat-V1-2 on ForensicsBench, Inference and Evalution
python run.py --data ForensicsBench --model InternVL-Chat-V1-2 --verbose
# InternVL-Chat-V1-2 on ForensicsBench, Inference only
python run.py --data ForensicsBench --model InternVL-Chat-V1-2 --verbose --mode infer

The evaluation results will be printed as logs, besides. Result Files will also be generated in the directory $YOUR_WORKING_DIRECTORY/{model_name}. Files ending with .csv contain the evaluated metrics.

Summary Scores: After evaluating the models, you can run the following script to view the summary scores.

python summary_scores.py --filename /path/to/your/csv

💐 Acknowledgement

We expressed sincerely gratitude for the projects listed following:

VLMEvalKit provides useful out-of-box tools and implements many adavanced LVLMs. Thanks for their selfless dedication.

🖊️ Citation

If you feel Forensics-Bench useful in your project or research, please kindly use the following BibTeX entry to cite our paper. Thanks!

@misc{wang2025forensicsbenchcomprehensiveforgerydetection,
    title={Forensics-Bench: A Comprehensive Forgery Detection Benchmark Suite for Large Vision Language Models}, 
    author={Jin Wang and Chenghui Lv and Xian Li and Shichao Dong and Huadong Li and kelu Yao and Chao Li and Wenqi Shao and Ping Luo},
    year={2025},
    eprint={2503.15024},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2503.15024}, 
 }

	Jun	Jul	Aug	Sep	Oct	Nov	Dec	Jan	Feb	Mar	Apr	May
Sun
Mon
Tue
Wed
Thu
Fri
Sat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Forensics-Bench

Block or report Forensics-Bench

Forensics-Bench

News 💡

Introduction

Quickstart

Step 0. Installation & Setup essential keys

Step 1. Configuration

Step 2. Evaluation

💐 Acknowledgement

🖊️ Citation

Popular repositories Loading

29 contributions in the last year

Contribution activity

June 1, 2025