Kyungmin Lee*, Sibeen Kim*, Minho Park, Hyunseung Kim, Dongyoon Hwang, Hojoon Lee, and Jaegul Choo
DAVIAN Robotics, KAIST AI
arXiv 2025. (* indicates equal contribution)
PHUMA leverages large-scale human motion data while overcoming physical artifacts through careful data curation and physics-constrained retargeting to create a high-quality humanoid locomotion dataset.
- Python 3.9
- CUDA 12.4 (recommended)
- Conda package manager
-
Clone the repository:
git clone https://github.com/DAVIAN-Robotics/PHUMA.git cd PHUMA -
Set up the environment:
conda create -n phuma python=3.9 -y conda activate phuma
-
Install dependencies:
pip install -r requirements.txt pip install -e .
Our physics-aware curation pipeline filters out problematic motions from human motion data to ensure physical plausibility.
We begin with the Humanoid-X collection as described in our paper. For more details, refer to the Humanoid-X repository. If you want to reproduce the PHUMA dataset, a practical starting point is Motion-X, which provides excellent documentation on SMPL-X pose data collection.
β °) Preprocess SMPL-X Data Format
Motion-X produces SMPL-X data in (N, 322) format, but PHUMA requires (N, 69) format, focusing on body pose and excluding face, hands, etc. If you're processing Motion-X data, you can convert it using our preprocessing script:
This script will:
- Recursively find all
.npyfiles in the input folder - Convert Motion-X format (N, 322) to PHUMA format (N, 69) by extracting
[transl, global_orient, body_pose] - Preserve the directory structure (e.g.,
aist/subset_0008/) in the output folder
python src/curation/preprocess_motionx_format.py \
--human_pose_folder /path_to_motionx_folder/subfolder \ # motionx_folder_path/humanml
--output_dir data/human_poseβ ±) Download SMPL-X Models
Before running the curation pipeline, you need to download the SMPL-X model files:- Visit SMPL-X official website
- Register and download the following files:
SMPLX_FEMALE.npzandSMPLX_FEMALE.pklSMPLX_MALE.npzandSMPLX_MALE.pklSMPLX_NEUTRAL.npzandSMPLX_NEUTRAL.pkl
- Place all downloaded files in the
asset/human_model/smplx/directory
The default thresholds are tuned to preserve motions with airborne phases (e.g., jumping) while filtering out physically implausible motions. This means some motions in PHUMA may contain minor penetration or floating artifacts. If you need stricter filtering for specific locomotion types (e.g., walking only), you can adjust the thresholds:
Single File Version
- For single file:
# Set your project directory
PROJECT_DIR="[REPLACE_WITH_YOUR_WORKING_DIRECTORY]/PHUMA"
cd $PROJECT_DIR
# We provide an example clip: data/human_pose/example/kick.npy
human_pose_file="example/kick"
python src/curation/preprocess_smplx.py \
--project_dir $PROJECT_DIR \
--human_pose_file $human_pose_file \
--foot_contact_threshold 0.8 \ # Default: 0.6. Increase to filter out more floating/penetration
--visualize 0- For folder:
# Set your project directory
PROJECT_DIR="[REPLACE_WITH_YOUR_WORKING_DIRECTORY]/PHUMA"
cd $PROJECT_DIR
human_pose_folder='data/human_pose/example'
python src/curation/preprocess_smplx_folder.py \
--project_dir $PROJECT_DIR \
--human_pose_folder $human_pose_folder \
--foot_contact_threshold 0.8 \ # Default: 0.6. Increase to filter out more floating/penetration
--visualize 0 \Output Details
- Preprocessed motion chunks:
example/kick_chunk_0000.npyandexample/kick_chunk_0001.npyunderdata/human_pose_preprocessed/ - If you set
--visualize 1, will also saveexample/kick_chunk_0000.mp4andexample/kick_chunk_0001.mp4underdata/video/human_pose_preprocessed/
For a complete list of tunable parameters, see src/curation/preprocess_smplx.py.
To address artifacts introduced during the retargeting process, we employ PhySINK, our physics-constrained retargeting method that adapts curated human motion to humanoid robots while enforcing physical plausibility.
# Find the SMPL-X shape that best fits a given humanoid robot
# This process only needs to be done once and can be reused for all motion files
python src/retarget/shape_adaptation.py \
--project_dir $PROJECT_DIR \
--robot_name g1Output: Shape parameters saved to asset/humanoid_model/g1/betas.npy
This step retargets human motion to robot motion using PhySINK optimization. You can process either a single file or an entire folder.
Single File Version
- For single file:
# Using the curated data from the previous step for Unitree G1 humanoid robot
human_pose_preprocessed_file="example/kick_chunk_0000"
python src/retarget/motion_adaptation.py \
--project_dir $PROJECT_DIR \
--robot_name g1 \
--human_pose_file $human_pose_preprocessed_file- For folder (with multiprocessing support):
human_pose_preprocessed_folder="data/human_pose_preprocessed/example"
python src/retarget/motion_adaptation_multiprocess.py \
--project_dir $PROJECT_DIR \
--robot_name g1 \
--human_pose_folder $human_pose_preprocessed_folder \
--gpu_ids 0,1,2,3 \
--processes_per_gpu 2Details
Multiprocessing Parameters:
--gpu_ids: Comma-separated GPU IDs (e.g.,0,1,2,3). If not specified, uses--device(default:cuda:0).--processes_per_gpu: Number of parallel processes per GPU (default: 1).- Recommended: 1-2 for RTX 3090 (24GB), 2-4 for A100 (40GB+)
- Total workers =
len(gpu_ids) Γ processes_per_gpu - Example:
--gpu_ids 0,1,2,3 --processes_per_gpu 2β 8 workers total
--num_workers: Manual override for total number of workers (default: auto-calculated from GPU settings)- Use
-1to use all available CPU cores (for CPU-only processing)
- Use
Additional Options:
--visualize: Set to1to generate visualization videos (default:0)--fps: Frame rate for output videos (default:30)--num_iter_dof: Number of optimization iterations (default:3001)--lr_dof: Learning rate for DOF optimization (default:0.005)- See
python src/retarget/motion_adaptation_multiprocess.py --helpfor all available options
Output:
- Retargeted humanoid motion data:
data/humanoid_pose/g1/kick_chunk_0000.npy- Format: Dictionary containing
root_trans,root_ori,dof_pos, andfps
- Format: Dictionary containing
- If you set
--visualize 1, will also savedata/video/humanoid_pose/g1/kick_chunk_0000.mp4
We support Unitree G1 and H1-2, but you can also retarget to custom humanoid robots. See our Custom Robot Integration Guide for details.
To reproduce our reported quantitative results, use the provided data splits located in data/split/:
phuma_train.txtphuma_test.txtunseen_video.txt
LAFAN1 Retargeted Data: Available here.
LocoMuJoCo Retargeted Data: Available here.
For motion tracking and path following tasks, we utilize the codebase from MaskedMimic.
FAQ List
**Q: Are you planning to release either the original or preprocessed human pose files?**A: Unfortunately, we cannot release human pose files of PHUMA Train/Test (phuma_train.txt and phuma_test.txt) due to license issues. However, we will release the SMPL human pose files for PHUMA Video (unseen_video.txt) soon! We haven't released them yet because our current codebase only supports SMPL-X. We will update the SMPL processing code soon and release the data at the same time.
Q: I want to process custom SMPL-X files with your code, but the orientation processing seems different.
A: For SMPL-X processing, we mainly follow the code of Motion-X. Taking AMASS as example, we follow this code (except face motion augmentation since we focus on locomotion).
Q: Some motions in PHUMA seem to have minor penetration or floating. Am I doing something wrong?
A: The default threshold values in the curation stage are tuned to preserve motions with airborne phases (e.g., jumping) while filtering out physically implausible motions. This trade-off means some motions may contain minor artifacts. If you need stricter filtering for specific locomotion types (e.g., walking only), you can adjust the curation thresholds such as --foot_contact_threshold. See the Tuning Curation Thresholds section for details.
Q: Can I retarget motions to custom humanoid robots using the PHUMA pipeline?
A: Yes! While PHUMA dataset is provided for Unitree G1 and H1-2, you can use our PhySINK retargeting pipeline with custom robots by following our Custom Robot Integration Guide. The guide covers adding heel/toe keypoints, creating configuration files, and tuning the retargeting process for your robot.
If you use this dataset or code in your research, please cite our paper:
@article{lee2025phuma,
title={PHUMA: Physically-Grounded Humanoid Locomotion Dataset},
author={Kyungmin Lee and Sibeen Kim and Minho Park and Hyunseung Kim and Dongyoon Hwang and Hojoon Lee and Jaegul Choo},
journal={arXiv preprint arXiv:2510.26236},
year={2025},
}