git clone https://github.com/XLearning-SCU/LLaVA-ReID.git
cd LLaVA-ReID
conda create -n llava-reid python=3.10 -y
conda activate llava-reid
pip install --upgrade pip # enable PEP 660 support
pip install -r requirements.txt
# Install FlashAttention (limit jobs to avoid memory issues)
MAX_JOBS=4 pip install flash-attn --no-build-isolation
Download annotation of Interactive-PEDES and organize your dataset as follows:
tree -L 3
.
├── CUHK-PEDES
│ ├── caption_all.json
│ ├── imgs
│ │ ├── cam_a
│ │ ├── cam_b
│ │ ├── CUHK01
│ │ ├── CUHK03
│ │ ├── Market
│ │ ├── test_query
│ │ └── train_query
│ └── readme.txt
├── ICFG-PEDES
│ ├── ICFG-PEDES.json
│ └── imgs
│ ├── test
│ └── train
└── Interactive-PEDES_interactive_annos.json
13 directories, 4 files
Our server can connect to internet, Pre-download the following models to your project directory:
- LLaVA-OneVision-Qwen2-7B-ov
- and its vision encoder SigLip to project folder:
├── LLaVA-ReID
│ ├── llava-onevision-qwen2-7b-ov
│ ├── siglip-so400m-patch14-384
...
Update llava-onevision-qwen2-7b-ov/config.json
:
{
"_name_or_path": "/public/home/pengxi_lab/project/LLaVA-ReID/llava-onevision-qwen2-7b-ov",
"image_aspect_ratio": "pad",
"mm_vision_tower": "/public/home/pengxi_lab/project/LLaVA-ReID/siglip-so400m-patch14-384",
"mm_spatial_pool_ratio": 2.0,
}
- Set the
stage
in./config/Interactive-PEDES.yaml
totrain_retriever
- update the
data_dir
inInteractive-PEDES.yaml
accordingly.
Launch with SLURM:
sbatch slurm_launch.sh
Or directly:
CUDA_VISIBLE_DEVICES=0,1,2,3 \
torchrun --nproc_per_node=4 --nnodes=1 --node_rank=0 --master_addr=0.0.0.0 --master_port=9000 \
main_train.py --config_file=config/Interactive-PEDES.yaml
Set the stage
in ./config/Interactive-PEDES.yaml
to prepare_data
and run again:
sbatch slurm_launch.sh
Set the stage
in ./config/Interactive-PEDES.yaml
to train_questioner
, then:
srun -p gpu4090_EU --gres=gpu:4 --cpus-per-task 16 -n1 --ntasks-per-node=1 --job-name=llava-reid \
torchrun --nnodes=1 --nproc_per_node=4 --master_port=9000 train_llava_reid.py --config_file=config/Interactive-PEDES.yaml
Set the stage
in ./config/Interactive-PEDES.yaml
to warmup_selector
and run again:
sbatch slurm_launch.sh
Set the stage
in ./config/Interactive-PEDES.yaml
to train_selector
and update model_path
in Interactive-PEDES.yaml
srun -p gpu4090_EU --gres=gpu:4 --cpus-per-task 16 -n1 --ntasks-per-node=1 --job-name=llava-reid \
torchrun --nproc_per_node=4 --master_port=9000 train_llava_reid.py --config_file=config/Interactive-PEDES.yaml
We use SGLang to accelerate the inference of Answerer. You need to install SGLang and launch a Qwen2.5-7B-Instruct server, for example:
CUDA_VISIBLE_DEVICES=4 python -m sglang.launch_server --model-path Qwen2.5/Qwen2.5-7B-Instruct --port 10500 \
--host "0.0.0.0" --mem-fraction-static 0.8 --api-key Qwen-7B
Replace the base_url in /model/llava_reid.py:
self.answer_model = AnswerGeneratorSGLang("http://192.168.49.58:10500/v1", "Qwen-7B")
CUDA_VISIBLE_DEVICES=0,1 python main_eval.py --config_file=config/Interactive-PEDES.yaml
We have uploaded the checkpoint of LLaVA-ReID, you can download it at BaiduCloud with extract code by2a
If this codebase is useful for your work, please cite the following papers:
@inproceedings{lullava2025,
title={LLaVA-ReID: Selective Multi-image Questioner for Interactive Person Re-Identification},
author={Lu, Yiding and Yang, Mouxing and Peng, Dezhong and Hu, Peng and Lin, Yijie and Peng, Xi},
booktitle={Forty-second International Conference on Machine Learning}
year={2025},
}
Some components of this code implementation are adopted from:
-
IRRA: Learning Fine-grained Relation for Text-to-Image Person Retrieval (CVPR 2023)
-
LLaVA-NeXT: Excellent Open Large Multimodal Models
-
SGLang: A fast serving framework for large language models and vision language models