This dataset is part our work AdaFuse: Adaptive Multiview Fusion for Accurate Human Pose Estimation in the Wild which is published on IJCV. The paper is available at (arXiv:2010.13302).
Fig 1 Human models we used in Occlusion-Person
The previous benchmarks do not provide occlusion labels for the joints in images which prevents us from performing numerical evaluation on the occluded joints. In addition, the amount of occlusion in the benchmarks is limited. To address the limitations, we propose to construct this synthetic dataset Occlusion-Person. We adopt UnrealCV to render multiview images and depth maps from 3D models.
In particular, thirteen human models of different clothes are put into nine different scenes such as living rooms, bedrooms and offices. The human models are driven by the poses selected from the CMU Motion Capture database. We purposely use objects such as sofas and desks to occlude some body joints. Eight cameras are placed in each scene to render the multiview images and the depth maps. We provide the 3D locations of 15 joints as ground truth.
The occlusion label for each joint in an image is obtained by comparing its depth value (available in the depth map), to the depth of the 3D joint in the camera coordinate system. If the difference between the two depth values is smaller than 30cm, then the joint is not occluded. Otherwise, it is occluded. The table below compares this dataset to the existing benchmarks. In particular, about 20% of the body joints are occluded in our dataset.
Dataset | Frames | Cameras | Occluded Joints |
---|---|---|---|
Human3.6M | 784k | 4 | - |
Total Capture | 236k | 8 | - |
Panoptic | 36k | 31 | - |
Occlusion-Person | 73k | 8 | 20.3% |
Fig 2 Some typical images, ground-truth 2D joint locations and the depth maps. The joint represented by red x
means it is occluded.
(We now provide a script to automatically download the data. Please see Next Sec. This section can be skipped.)
Please manually download from OneDrive to a folder, e.g. ./data
, for now. (We are working on writing a script to automatically fetch the data). We split it into 53 parts due to per file size limit of OneDrive. Each part is about 1GB.
After all parts are fully downloaded, you should have files like this:
data
├── occlusion_person.zip.001
├── occlusion_person.zip.002
├── ...
├── occlusion_person.zip.053
You can run find ./data -type f | xargs md5sum > downloaded_checksum.txt
to generate all MD5 checksum for the files (this may take long time). Then compare it to our pre-generated checksum file checksum.txt
by diff checksum.txt downloaded_checksum.txt
.
Then, extract the images.zip
by 7z x ./data/occlusion_person.zip.001
. You should have images.zip
in current directory.
Please run the script using python 3:
pip install python3-wget
python download.py
We also provide the train/val annotation files used in our experiments at OneDrive/annot.
Finally, organize the images and annotations into below structure:
unrealcv
├── images.zip
├── annot
├── unrealcv_train.pkl
├── unrealcv_validation.pkl
All done.
An annotation (.pkl
) contains a list of items. Each item is associated with an image by the "image"
attribute.
We list all the attributes below, and describe their meanings.
image
: str, e.g.a05_sa01_s14_sce00/00_000000.jpg
| the path to the associated image filejoints_2d
: ndarray (15,2) | 2D ground-truth joint location (x, y) in image framejoints_3d
: ndarray (15,3) | 3D ground-truth joint location (x, y, z) in camera framejoints_gt
: ndarray (15,3) | 3D ground-truth joint location (x, y, z) in global framejoints_vis
: ndarray (15,1) | indicating if the joint is within the image boundaryjoints_vis_2d
: ndarray (15,1) | indicating if the joint is within the image boundary and not occludedcenter
: (2,) | ground-truth bounding box center in image framescale
: (2,) | ground-truth bounding box size in image frame (multiply the number with 200)box
: (4,) | ground-truth bounding box (top-left and bottom-right coordinates, can also be inferred fromcenter
andscale
)video_id
: str, e.g.a05_sa01_s14_sce00
image_id
: int |video_id
andimage_id
can be used to differentiate framessubject
: intaction
: intsubaction
: intcamera_id
: int | 0-7 in this datasetcamera
: dict | camera extrinsic and intrinsic parameters, note that theT
definition is different. For detailed information please refer to our release code (TODO)source
: str | an alias for the dataset name, will be same across a dataset
If you use this dataset, please consider citing our work.
@article{zhang2020adafuse,
title={AdaFuse: Adaptive Multiview Fusion for Accurate Human Pose Estimation in the Wild},
author={Zhe Zhang and Chunyu Wang and Weichao Qiu and Wenhu Qin and Wenjun Zeng},
year={2020},
journal={IJCV},
publisher={Springer},
pages={1--16},
}