- 2025/09/26:🔥🔥🔥 We release our VideoChat-R1.5 model at Huggingface, paper, and eval code.
- 2025/09/22: 🎉🎉🎉 Our VideoChat-R1.5 is accepted by NIPS2025.
- 2025/04/22:🔥🔥🔥 We release our VideoChat-R1-caption at Huggingface.
- 2025/04/14:🔥🔥🔥 We release our VideoChat-R1 and VideoChat-R1-thinking at Huggingface.
- 2025/04/10:🔥🔥🔥 We release our VideoChat-R1 paper and code.
Across short-form & long-form videos, temporal grounding, video reasoning, and spatio-temporal perception, the model delivers consistently stronger results.
We adopt multi-task joint RL to strengthen the model’s spatio-temporal perception and reasoning capabilities.
During inference, we simulate hierarchical human attention to enable the model to progressively localize the Region of Interest (ROI) within input videos. This multi-step perception process ensures that the model's performance improves with each step.
Please refer to hf README for the steps required to perform inference..
See eval_scripts and lmms-eval_videochat.
See training_scripts.
If you find this project useful in your research, please consider cite:
@article{li2025videochatr1,
title={VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning},
author={Li, Xinhao and Yan, Ziang and Meng, Desen and Dong, Lu and Zeng, Xiangyu and He, Yinan and Wang, Yali and Qiao, Yu and Wang, Yi and Wang, Limin},
journal={arXiv preprint arXiv:2504.06958},
year={2025}
}
@article{yan2025videochatr15,
title={VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception},
author={Yan, Ziang and Li, Xinhao and He, Yinan and Zhengrong Yue and Zeng, Xiangyu and Wang, Yali and Qiao, Yu and Wang, Limin and Wang, Yi},
journal={arXiv preprint arXiv:2509.21100},
year={2025}
}