A device that can real-time detect pose, face, and hand simultaneously provides an opportunity to develop novel innovations such as sport and fitness analytics, gesture control and sign language recognition, and augmented reality effects on Instagram. Developing this simultaneous device was uniquely difficult, requiring simultaneous inference of multiple, dependent neural networks.
Previously, Google has developed an open-source AI MediaPipe framework that is designed for the complex perception of face, hands, and pose separately by utilizing accelerated inference. Now Google AI has taken it to another step by combining them all together in "MediaPipe Holistic", a solution for the real-time simultaneous perception of human pose, face landmarks, and hand.
MediaPipe Holistic is a state-of-art topology that perceives face, hand, and pose simultaneously in a novel way. It consists of the new face, hand, and pose perception components which are optimized for real-time use by transferring minimum memories between interference backends, enabling the pipeline to run consistently in real-time.
The pipeline supports the interchangeability of each component, on which its effectiveness depends on the quality and speed of the tradeoffs. Combining all three components together provides a unified topology with an ability to perceive 540+ keypoints in real-time (33 for a pose, 21 per hand, and 468 facial landmarks) which is a groundbreaking result and it also provides near real-time performance on mobile devices.
How it works
The MediaPipe Holistic pipeline works by integrating three separate models for face, pose, and hand perception which all are optimized for their domain. With the different specializations of three components, the input for one may not suit the others. For example, the pose estimation model needs a low-resolution video frame, but when the time and face regions need to be cropped from the image and passed to their particular model, it turns out that the image might be too low for accurate perception. As a result, Google AI designed MediaPipe Holistic as a multi-stage pipeline, which takes the appropriate image resolution for the different regions.
Firstly, MediaPipe Holistic estimates the human pose with the pose detection model. It derives inferred pose keypoints and derives the three regions of interest (ROI), face, hand, and pose. The model crops each hand and face and re-crop again to improve the resolution. It then inputs the full-resolution frames to these ROIs and applies particular hand and face models to estimate their keypoints. Lastly, it merges all the keypoints together yielding the full 540+ keypoints.
To improve the efficiency of ROI identification, the same approach as the standalone model is needed. The approach is to infer that the object doesn't significantly move by estimating and guiding the region object of the current frame from the previous frame. However, if the object moves too fast, the tracker might be unable to detect the target, resulting in the re-localization of the target. The pipeline predicts the pose on every frame in advance and treats it as an additional ROI to reduce the response time when there are fast movements. This approach also helps the model remain consistent across the body and prevents a mixup between right and left hands, or different parts of the body.
Normally the input frame for the pose estimation model is too low, making face and hands' ROIs too inaccurate to guide the re-crop model for these regions. The lightweight but precise re-crop model is required as a spatial transformer which costs only 10 percent of the model's inference time.
MediaPipe Holistics’ Performance
MediaPipe Holistic needs a correlation between 8 models per frame: 1 pose detector, 1 pose landmark model, 3 re-crop models, and 3 keypoint models for hands and face. During the building process, Google optimized both machine learning and algorithms for both pre-and post-processing, which normally, due to the pipeline's complexity, takes quite an amount of time to compute. However, Google moves all the pre-processing computations to GPU, speeding up the overall pipeline by 1.5 times depending on different devices. MediaPipe Holistic offers near real-time performance even on mid-tier devices and in the browser.
Working in multi-stages, MediaPipe Holistic provides at least two performance benefits. Due to each model working independently, it can be replaced with the lighter or heavier model depending on the level of performance and accuracy required, or even completely turned off. Secondly, because the pose is also inferred, the model perceives whether the hands and face are within the frame bounds, enabling it to skip the inference stage on those parts.
MediaPipe Holistics’ applications
Google determines that MediaPipe Holistic will facilitate real-time simultaneous perception of body language, gesture, and facial expressions, which will achieve effective remote gesture interfaces, as well as full-body AR, sports analytics, and sign language recognition devices.
Google also built a remote control interface running locally in users' browsers allowing users to manipulate objects on the screen with a virtual keyboard remotely while sitting on the sofa. No mouse or actual keyboard is required. Users can touch a specific region to turn off their camera or mute their mic. In addition, there is a hand detection system that acts as a substitute for a trackpad anchored to the user’s shoulder, allowing up to 4 meters of remote control.
This technique for gesture control can unlock novel use-cases that other computers are incapable of. Try MediaPipe Holistic Interface from Google at: https://mediapipe.dev/demo/holistic_remote/
MediaPipe for research and web
Google hopes that this MediaPipe Holistic release will inspire researchers to build more and more publications and hope it opens new horizons for future research and innovations such as sign language recognition, touchless control, and more complex applications.