DOGR

Towards Versatile Visual Document Grounding and Referring

Yinan Zhou*, Yuxin Chen*, Haokun Lin, Yichen Wu, Shuyu Yang, Li Zhu‡, Zhongang Qi‡, Chen Ma‡, Ying Shan

*Equal Contribution †Project Lead ‡Corresponding Authors

Abstract

In recent years, Multimodal Large Language Models (MLLMs) have increasingly emphasized grounding and referring capabilities to achieve detailed understanding and flexible user interaction. However, in the realm of visual document understanding, these capabilities lag behind due to the scarcity of fine-grained datasets and comprehensive benchmarks. To fill this gap, we propose the DOcument Grounding and Referring data engine (DOGR-Engine), which produces two types of high-quality fine-grained document data: multi-granular parsing data for enhancing fundamental text localization and recognition capabilities; and instruction-tuning data to activate MLLM's grounding and referring capabilities during dialogue and reasoning. Additionally, using our engine, we construct DOGR-Bench, which encompasses 7 grounding and referring tasks across 3 document types (chart, poster, PDF document), providing comprehensive evaluations for fine-grained document understanding. Furthermore, leveraging the data generated by our engine, we develop a strong baseline model, DOGR. This pioneering MLLM is capable of accurately referring and grounding texts at multiple granularities within document images.

📢 News

2025-08-06: training and inference code (this repository) are released.
2025-08-06: Accepted to ICCV 2025.

📽️Video

Watch the introduction video here!

⚙️Installation

Our code is based on LLaVA-NeXT. Installation can be found in LLaVA-NeXT

🏃🏻‍♂️‍➡️ Inference

bash scripts/eval/dogr_bench_inference_ddp.sh

🏃🏻‍♂️‍➡️ Eval on DOGR-Bench

scripts/eval/dogr_evaluation.sh

😃DOGR Weights

The model weights are in MODEL WEIGHTS

🖥️Demo

python inference/demo_gradio.py

🖌️Citation

@misc{zhou2025dogrversatilevisualdocument,
      title={DOGR: Towards Versatile Visual Document Grounding and Referring}, 
      author={Yinan Zhou and Yuxin Chen and Haokun Lin and Yichen Wu and Shuyu Yang and Zhongang Qi and Chen Ma and Li Zhu and Ying Shan},
      year={2025},
      eprint={2411.17125},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.17125}, 
}

📑LICENSE

Please refer to our LICENSE for more license details.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
evaluation		evaluation
inference		inference
llava.egg-info		llava.egg-info
llava		llava
scripts		scripts
trl		trl
LICENSE		LICENSE
README.md		README.md
cog.yaml		cog.yaml
doge.png		doge.png
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DOGR

Abstract

📢 News

📽️Video

⚙️Installation

🏃🏻‍♂️‍➡️ Inference

🏃🏻‍♂️‍➡️ Eval on DOGR-Bench

😃DOGR Weights

🖥️Demo

🖌️Citation

📑LICENSE

About

Uh oh!

Releases

Packages

Languages

License

Tencent/DOGR

Folders and files

Latest commit

History

Repository files navigation

DOGR

Abstract

📢 News

📽️Video

⚙️Installation

🏃🏻‍♂️‍➡️ Inference

🏃🏻‍♂️‍➡️ Eval on DOGR-Bench

😃DOGR Weights

🖥️Demo

🖌️Citation

📑LICENSE

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages