Skip to content

ustcfd/PoseLLaVA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PoseLLaVA: Pose Centric Multimodal LLM for Fine-Grained 3D Pose Manipulation

Paper Code

PoseLLaVA teaser image showing fine-grained 3D pose manipulation

PoseLLaVA is a multimodal framework for fine-grained human pose manipulation using natural language. By integrating SMPL-based pose representations into the LLaVA architecture, PoseLLaVA enables detailed pose control. It achieves precise alignment across language, pose, and image modalities, allowing for seamless multimodal understanding. PoseLLaVA supports pose estimation, generation, and adjustment within a unified framework, all driven by rich textual instructions. We also release PosePart, a new dataset containing paired poses and fine-grained adjustment prompts, simulating the guidance a human instructor might provide.

Getting Started

Requirements and Installation

  • Python >= 3.10
  • Pytorch == 2.1.0
  • CUDA Version >= 11.7
  • Install required packages:
pip install -r requirements.txt

Model Download

Model Descriptions 🤗 HF Link
PoseLLaVA Huggingface

PosePart Dataset

Dataset Descriptions Link
PosePart GoogleDrive

Training

Dataset Preparation

We will organize the multimodal datasets for each pose sub-task according to the example below and follow the format for each part of the training dataset. In this format, <image> and <pose> serve as placeholders for two different input modalities: the image and the SMPL parameters.

Example for Pose Generation task
[
  {
    "id": "135000",
    "target_pose": ["Ground truth SMPL parameters.... "],
    "conversations": [
      {
        "from": "human",
        "value": "There is a person: torso is slightly leaning to the left, right arm is straight forward and hand is pointed down, left arm is down and bent up with the hand in front of chin Please output this person's SMPL pose."
      },
      {
        "from": "gpt",
        "value": "The SMPL pose of the person is <POSE>."
      }
    ]
  }
  ...
]
Inputing image example for Pose Estimation or Pose Adjustment task
[
  {
    "id": "135000",
    "image": "S8_WalkDog_1.55011271_000026.jpg",
    "target_pose": ["Ground truth SMPL parameters.... "],
    "conversations": [
      {
        "from": "human",
        "value": "<image> + pose estimation/adjustment instruction"
      },
      {
        "from": "gpt",
        "value": "Sure, it is <POSE>."
      }
    ]
  }
  ...
]
Inputing SMPL Parameters example for Pose Adjustment task
[
  {
    "id": "135000",
    "input_pose": ["Initialized SMPL parameters.... "],
    "target_pose": ["Ground truth SMPL parameters.... "],
    "conversations": [
      {
        "from": "human",
        "value": "<pose> Please peruse the description below. Extend your right arm to the right and back while keeping your right upper arm flat, with straight knees and feet shoulder-width apart. Start with the pose and use the textual description to generate the corresponding adjusted SMPL pose."
      },
      {
        "from": "gpt",
        "value": "The SMPL pose of the person is <POSE>."
      }
    ]
  }
  ...
]

Training Dataset Configuration

We configure the training datasets for all tasks using a single JSON file ds_config/train_ds.json. The root specifies the directory of the image files, while SMPLpose is left empty. The annotation represents the path to the training files, and repeat_time and length can be used together to configure the ratio of training data for each part.

Finetune with LoRA

Set the parameters in the script shell/posellava_sft_lora.sh, where meta_path is the path to the training dataset configuration file. To run the training script, use the following command:

bash shell/posellava_sft_lora.sh

Evaluation

Similar to the training setup, configure the evaluation datasets for each sub-task in the ds_config/eval_ds.json JSON file. In the evaluation script, set the path to the fine-tuned model weights. To run the evaluation script, use the following command:

bash shell/run_inference.sh

Demo

We provide a gradio-based web demo (demo video located at papers/posellava_demo.mp4). The demo loads the trained model and visually demonstrates its performance. Note that the web demo does not accept SMPL Parameters inputs, as inputing 72 SMPL parameters in the frontend is not practical; however, our VLLM does support this. You can test the SMPL input effect using the evaluation script.

Launch a model server

Configure the IP and Port in config file(gradio_demos/api/config.py)and run the following command.

python gradio_demos/api/posechat_server.py

Gradio Web UI

To run the web ui, use the following command:

bash shell/posellava_demo.sh

Citation

If you find our work useful for your research, please consider citing the paper:

@inproceedings{feng2025posellava,
  title={PoseLLaVA: Pose Centric Multimodal LLM for Fine-Grained 3D Pose Manipulation},
  author={Feng, Dong and Guo, Ping and Peng, Encheng and Zhu, Mingmin and Yu, Wenhao and Wang, Peng},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={39},
  number={3},
  pages={2951--2959},
  year={2025}
}

Acknowledgments

This project built with reference to the code of the following projects:lmms-finetune, PoseGPT, LLaVA and InternVL. We are grateful to them for releasing their models and code as open-source contributions.

About

[AAAI 2025] PoseLLaVA: Pose Centric Multimodal LLM for Fine-Grained 3D Pose Manipulation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published