Skip to content

Latest commit

 

History

History
216 lines (148 loc) · 13 KB

README.md

File metadata and controls

216 lines (148 loc) · 13 KB

Google MediaPipe for Pose Estimation

MediaPipe is a cross-platform framework for building multimodal applied machine learning pipelines including inference models and media processing functions.

The main purpose of this repo is to:

  • Customize output of MediaPipe solutions
  • Customize visualization of 2D & 3D outputs
  • Demo some simple applications on Python (refer to Demo Overview)
  • Demo some simple applications on JavaScript refer to java folder

Pose Estimation with Input Color Image

Attractiveness of Google MediaPipe as compared to other SOTA (e.g. FrankMocap, CMU OpenPose, DeepPoseKit, DeepLabCut, MinimalHand):

  • Fast: Runs at almost realtime rate on CPU and even mobile devices
  • Open-source: Codes are freely available at github (except that details of network models are not released)
  • User-friendly: For python API just pip install mediapipe will work (but C++ API is much more troublesome to build and use)
  • Cross-platform: Works across Android, iOS, desktop, JavaScript and web (Note: this repo only focuses on using Python API for desktop usage)
  • ML Solutions: Apart from face, hand, body and object pose estimations, MediaPipe offers an array of machine learning applications refer to their github for more details

Features

Latest MediaPipe Python API version 0.8.9.1 (Released 14 Dec 2021) features:

Face Detect (2D face detection)

Face Mesh (468/478 3D face landmarks)

Hands (21 3D landmarks and able to support multiple hands, 2 levels of model complexity) (NEW world coordinates)

Body Pose (33 3D landmarks for whole body, 3 levels of model complexity)

Holistic (Face + Hands + Body) (A total of 543/535 landmarks: 468 face + 2 x 21 hands + 33/25 pose)

Objectron (3D object detection and tracking) (4 possible objects: Shoe / Chair / Camera / Cup)

Selfie Segmentation (Segments human for selfie effect/video conferencing)

Note: The above videos are presented at CVPR 2020 Fourth Workshop on Computer Vision for AR/VR, interested reader can refer to the link for other related works.

Installation

The simplest way to run our implementation is to use anaconda.

You can create an anaconda environment called mp with

conda env create -f environment.yaml
conda activate mp

Demo Overview

Single Image Video Input Gesture Recognition Rock Paper Scissor Game
IMAGE ALT TEXT HERE
Measure Hand ROM Measure Wrist and Forearm ROM Face Mask Triangulate Points for 3D Pose
3D Skeleton 3D Object Detection Selfie Segmentation

Usage

5 different modes are available and sample images are located in data/sample/ folder

python 00_image.py --mode face_detect
python 00_image.py --mode face
python 00_image.py --mode hand
python 00_image.py --mode body
python 00_image.py --mode holistic

Note: The sample images for subject with body marker are adapted from An Asian-centric human movement database capturing activities of daily living and the image of Mona Lisa is adapted from Wiki

5 different modes are available and video capture can be done online through webcam or offline from your own .mp4 file

python 01_video.py --mode face_detect
python 01_video.py --mode face
python 01_video.py --mode hand
python 01_video.py --mode body
python 01_video.py --mode holistic

Note: It takes around 10 to 30 FPS on CPU, depending on the mode selected. The video demonstrating supported mini-squats is adapted from National Stroke Association

2 modes are available: Use evaluation mode to perform recognition of 11 gestures and use train mode to log your own training data

python 02_gesture.py --mode eval
python 02_gesture.py --mode train

Note: A simple but effective K-nearest neighbor (KNN) algorithm is used as the classifier. For the hand gesture recognition demo, since 3D hand joints are available, we can compute flexion joint angles (feature vector) and use it to classify different hand poses. On the other hand, if 3D body joints are not yet reliable, the normalized pairwise distances between predifined lists of joints as described in MediaPipe Pose Classification could also be used as the feature vector for KNN.

Simple game of rock paper scissor requires a pair of hands facing the camera

python 03_game_rps.py

For another game of flappy bird refer to this github

2 modes are available: Use evaluation mode to perform hand ROM recognition and use train mode to log your own training data

python 04_hand_rom.py --mode eval
python 04_hand_rom.py --mode train

3 modes are available and user has to input the side of the hand to be measured

  • 0: Wrist flexion/extension
  • 1: Wrist radial/ulnar deviation
  • 2: Forearm pronation/supination
python 05_wrist_rom.py --mode 0 --side right
python 05_wrist_rom.py --mode 1 --side right
python 05_wrist_rom.py --mode 2 --side right
python 05_wrist_rom.py --mode 0 --side left
python 05_wrist_rom.py --mode 1 --side left
python 05_wrist_rom.py --mode 2 --side left

Note: For measuring forearm pronation/supination, the camera has to be placed at the same level as the hand such that palmar side of the hand is directly facing camera. For measuring wrist ROM, the camera has to be placed such that upper body of the subject is visible, refer to examples of wrist_XXX.png images in data/sample/ folder. The wrist images are adapted from Goni Wrist Flexion, Extension, Radial & Ulnar Deviation

Overlay a 3D face mask on the detected face in image plane

python 06_face_mask.py

Note: The face image is adapted from MediaPipe 3D Face Transform

Estimating 3D body pose from a single 2D image is an ill-posed problem and extremely challenging. One way to reconstruct 3D body pose is to make use of multiview setup and perform triangulation. For offline testing, use CMU Panoptic Dataset, follow the instructions on PanopticStudio Toolbox to download a sample dataset 171204_pose1_sample into data/ folder

python 07_triangulate.py --mode body --use_panoptic_dataset

3D pose estimation is available in full-body mode and this demo displays the estimated 3D skeleton of the hand and/or body. 3 different modes are available and video capture can be done online through webcam or offline from your own .mp4 file

python 08_skeleton_3D.py --mode hand
python 08_skeleton_3D.py --mode body
python 08_skeleton_3D.py --mode holistic

4 different modes are available and a sample image is located in data/sample/ folder. Currently supports 4 classes: Shoe / Chair / Cup / Camera.

python 09_objectron.py --mode shoe
python 09_objectron.py --mode chair
python 09_objectron.py --mode cup
python 09_objectron.py --mode camera

2 modes are available. The landscape mode has fewer FLOPS than the general model and may run faster. The selfie segmentation works best for selfie effects and video conferencing, where the person is close (< 2m) to the camera.

python 10_segmentation.py --mode general
python 10_segmentation.py --mode landscape

Limitations:

Estimating 3D pose from a single 2D image is an ill-posed problem and extremely challenging, thus the measurement of ROM may not be accurate! Please refer to the respective model cards for more details on other types of limitations such as lighting, motion blur, occlusions, image resolution, etc.