PoseLLaVA is a multimodal framework for fine-grained human pose manipulation using natural language. By integrating SMPL-based pose representations into the LLaVA architecture, PoseLLaVA enables detailed pose control. It achieves precise alignment across language, pose, and image modalities, allowing for seamless multimodal understanding. PoseLLaVA supports pose estimation, generation, and adjustment within a unified framework, all driven by rich textual instructions. We also release PosePart, a new dataset containing paired poses and fine-grained adjustment prompts, simulating the guidance a human instructor might provide.
- Python >= 3.10
- Pytorch == 2.1.0
- CUDA Version >= 11.7
- Install required packages:
pip install -r requirements.txt
Model Descriptions | 🤗 HF Link |
---|---|
PoseLLaVA | Huggingface |
Dataset Descriptions | Link |
---|---|
PosePart | GoogleDrive |
We will organize the multimodal datasets for each pose sub-task according to the example below and follow the format for each part of the training dataset. In this format, <image>
and <pose>
serve as placeholders for two different input modalities: the image and the SMPL parameters.
Example for Pose Generation task
[
{
"id": "135000",
"target_pose": ["Ground truth SMPL parameters.... "],
"conversations": [
{
"from": "human",
"value": "There is a person: torso is slightly leaning to the left, right arm is straight forward and hand is pointed down, left arm is down and bent up with the hand in front of chin Please output this person's SMPL pose."
},
{
"from": "gpt",
"value": "The SMPL pose of the person is <POSE>."
}
]
}
...
]
Inputing image example for Pose Estimation or Pose Adjustment task
[
{
"id": "135000",
"image": "S8_WalkDog_1.55011271_000026.jpg",
"target_pose": ["Ground truth SMPL parameters.... "],
"conversations": [
{
"from": "human",
"value": "<image> + pose estimation/adjustment instruction"
},
{
"from": "gpt",
"value": "Sure, it is <POSE>."
}
]
}
...
]
Inputing SMPL Parameters example for Pose Adjustment task
[
{
"id": "135000",
"input_pose": ["Initialized SMPL parameters.... "],
"target_pose": ["Ground truth SMPL parameters.... "],
"conversations": [
{
"from": "human",
"value": "<pose> Please peruse the description below. Extend your right arm to the right and back while keeping your right upper arm flat, with straight knees and feet shoulder-width apart. Start with the pose and use the textual description to generate the corresponding adjusted SMPL pose."
},
{
"from": "gpt",
"value": "The SMPL pose of the person is <POSE>."
}
]
}
...
]
We configure the training datasets for all tasks using a single JSON file ds_config/train_ds.json
. The root
specifies the directory of the image files, while SMPLpose is left empty. The annotation
represents the path to the training files, and repeat_time
and length
can be used together to configure the ratio of training data for each part.
Set the parameters in the script shell/posellava_sft_lora.sh
, where meta_path
is the path to the training dataset configuration file. To run the training script, use the following command:
bash shell/posellava_sft_lora.sh
Similar to the training setup, configure the evaluation datasets for each sub-task in the ds_config/eval_ds.json
JSON file. In the evaluation script, set the path to the fine-tuned model weights. To run the evaluation script, use the following command:
bash shell/run_inference.sh
We provide a gradio-based web demo (demo video located at papers/posellava_demo.mp4
). The demo loads the trained model and visually demonstrates its performance. Note that the web demo does not accept SMPL Parameters inputs, as inputing 72 SMPL parameters in the frontend is not practical; however, our VLLM does support this. You can test the SMPL input effect using the evaluation script.
Configure the IP and Port in config file(gradio_demos/api/config.py
)and run the following command.
python gradio_demos/api/posechat_server.py
To run the web ui, use the following command:
bash shell/posellava_demo.sh
If you find our work useful for your research, please consider citing the paper:
@inproceedings{feng2025posellava,
title={PoseLLaVA: Pose Centric Multimodal LLM for Fine-Grained 3D Pose Manipulation},
author={Feng, Dong and Guo, Ping and Peng, Encheng and Zhu, Mingmin and Yu, Wenhao and Wang, Peng},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={39},
number={3},
pages={2951--2959},
year={2025}
}
This project built with reference to the code of the following projects:lmms-finetune, PoseGPT, LLaVA and InternVL. We are grateful to them for releasing their models and code as open-source contributions.