Drone control via gestures using MediaPipe Hands

A guest post by Neurons Lab

Please note that the information, uses, and applications expressed in the below post are solely those of our guest author, Neurons Lab, and not necessarily those of Google.

How the idea emerged

With the advancement of technology, drones have become not only smaller, but also have more compute. There are many examples of iPhone-sized quadcopters in the consumer drone market and the computing power to do live tracking while recording 4K video. However, the most important element has not changed much – the controller. It is still bulky and not intuitive for beginners to use. There is a smartphone with on-display control as an option; however, the control principle is still the same. 

That is how the idea for this project emerged: a more personalised approach to control the drone using gestures. ML Engineer Nikita Kiselov (me) together with consultation from my colleagues at Neurons Lab undertook this project. 

Demonstration of drone flight control via gestures using MediaPipe Hands

Figure 1: [GIF] Demonstration of drone flight control via gestures using MediaPipe Hands

Why use gesture recognition?

Gestures are the most natural way for people to express information in a non-verbal way.  Gesture control is an entire topic in computer science that aims to interpret human gestures using algorithms. Users can simply control devices or interact without physically touching them. Nowadays, such types of control can be found from smart TV to surgery robots, and UAVs are not the exception.

Although gesture control for drones have not been widely explored lately, the approach has some advantages:

  • No additional equipment needed.
  • More human-friendly controls.
  • All you need is a camera that is already on all drones.

With all these features, such a control method has many applications.

Flying action camera. In extreme sports, drones are a trendy video recording tool. However, they tend to have a very cumbersome control panel. The ability to use basic gestures to control the drone (while in action)  without reaching for the remote control would make it easier to use the drone as a selfie camera. And the ability to customise gestures would completely cover all the necessary actions.

This type of control as an alternative would be helpful in an industrial environment like, for example, construction conditions when there may be several drone operators (gesture can be used as a stop-signal in case of losing primary source of control).

The Emergencies and Rescue Services could use this system for mini-drones indoors or in hard-to-reach places where one of the hands is busy. Together with the obstacle avoidance system, this would make the drone fully autonomous, but still manageable when needed without additional equipment.

Another area of application is FPV (first-person view) drones. Here the camera on the headset could be used instead of one on the drone to recognise gestures. Because hand movement can be impressively precise, this type of control, together with hand position in space, can simplify the FPV drone control principles for new users. 

However, all these applications need a reliable and fast (really fast) recognition system. Existing gesture recognition systems can be fundamentally divided into two main categories: first – where special physical devices are used, such as smart gloves or other on-body sensors; second – visual recognition using various types of cameras. Most of those solutions need additional hardware or rely on classical computer vision techniques. Hence, that is the fast solution, but it’s pretty hard to add custom gestures or even motion ones. The answer we found is MediaPipe Hands that was used for this project.

Overall project structure

To create the proof of concept for the stated idea, a Ryze Tello quadcopter was used as a UAV. This drone has an open Python SDK, which greatly simplified the development of the program. However, it also has technical limitations that do not allow it to run gesture recognition on the drone itself (yet). For this purpose a regular PC or Mac was used. The video stream from the drone and commands to the drone are transmitted via regular WiFi, so no additional equipment was needed. 

To make the program structure as plain as possible and add the opportunity for easily adding gestures, the program architecture is modular, with a control module and a gesture recognition module. 

Scheme that shows overall project structure and how videostream data from the drone is processed

Figure 2: Scheme that shows overall project structure and how videostream data from the drone is processed

The application is divided into two main parts: gesture recognition and drone controller. Those are independent instances that can be easily modified. For example, to add new gestures or change the movement speed of the drone.

Video stream is passed to the main program, which is a simple script with module initialisation, connections, and typical for the hardware while-true cycle. Frame for the videostream is passed to the gesture recognition module. After getting the ID of the recognised gesture, it is passed to the control module, where the command is sent to the UAV. Alternatively, the user can control a drone from the keyboard in a more classical manner.

So, you can see that the gesture recognition module is divided into keypoint detection and gesture classifier. Exactly the bunch of the MediaPipe key point detector along with the custom gesture classification model distinguishes this gesture recognition system from most others.

Gesture recognition with MediaPipe

Utilizing MediaPipe Hands is a winning strategy not only in terms of speed, but also in flexibility. MediaPipe already has a simple gesture recognition calculator that can be inserted into the pipeline. However, we needed a more powerful solution with the ability to quickly change the structure and behaviour of the recognizer. To do so and classify gestures, the custom neural network was created with 4 Fully-Connected layers and 1 Softmax layer for classification.

Figure 3: Scheme that shows the structure of classification neural network

Figure 3: Scheme that shows the structure of classification neural network

This simple structure gets a vector of 2D coordinates as an input and gives the ID of the classified gesture. 

Instead of using cumbersome segmentation models with a more algorithmic recognition process, a simple neural network can easily handle such tasks. Recognising gestures by keypoints, which is a simple vector with 21 points` coordinates, takes much less data and time. What is more critical, new gestures can be easily added because model retraining tasks take much less time than the algorithmic approach.

To train the classification model, dataset with keypoints` normalised coordinates and ID of a gesture was used. The numerical characteristic of the dataset was that:

  • 3 gestures with 300+ examples (basic gestures)
  • 5 gestures with 40 -150 examples 

All data is a vector of x, y coordinates that contain small tilt and different shapes of hand during data collection.

Figure 4: Confusion matrix and classification report for classification

Figure 4: Confusion matrix and classification report for classification

We can see from the classification report that the precision of the model on the test dataset (this is 30% of all data) demonstrated almost error-free for most classes, precision > 97% for any class. Due to the simple structure of the model, excellent accuracy can be obtained with a small number of examples for training each class. After conducting several experiments, it turned out that we just needed the dataset with less than 100 new examples for good recognition of new gestures. What is more important, we don’t need to retrain the model for each motion in different illumination because MediaPipe takes over all the detection work.

Figure 5: [GIF] Test that demonstrates how fast classification network can distinguish newly trained gestures using the information from MediaPipe hand detector

Figure 5: [GIF] Test that demonstrates how fast classification network can distinguish newly trained gestures using the information from MediaPipe hand detector

From gestures to movements

To control a drone, each gesture should represent a command for a drone. Well, the most excellent part about Tello is that it has a ready-made Python API to help us do that without explicitly controlling motors hardware. We just need to set each gesture ID to a command.

Figure 6: Command-gesture pairs representation

Figure 6: Command-gesture pairs representation

Each gesture sets the speed for one of the axes; that’s why the drone’s movement is smooth, without jitter. To remove unnecessary movements due to false detection, even with such a precise model, a special buffer was created, which is saving the last N gestures. This helps to remove glitches or inconsistent recognition.

The fundamental goal of this project is to demonstrate the superiority of the keypoint-based gesture recognition approach compared to classical methods. To demonstrate all the potential of this recognition model and its flexibility, there is an ability to create the dataset on the fly … on the drone`s flight! You can create your own combinations of gestures or rewrite an existing one without collecting massive datasets or manually setting a recognition algorithm. By pressing the button and ID key, the vector of detected points is instantly saved to the overall dataset. This new dataset can be used to retrain classification network to add new gestures for the detection. For now, there is a notebook that can be run on Google Colab or locally. Retraining the network-classifier takes about 1-2 minutes on a standard CPU instance. The new binary file of the model can be used instead of the old one. It is as simple as that. But for the future, there is a plan to do retraining right on the mobile device or even on the drone.

Figure 7: Notebook for model retraining in action

Figure 7: Notebook for model retraining in action

Summary 

This project is created to make a push in the area of the gesture-controlled drones. The novelty of the approach lies in the ability to add new gestures or change old ones quickly. This is made possible thanks to MediaPipe Hands. It works incredibly fast, reliably, and ready out of the box, making gesture recognition very fast and flexible to changes. Our Neuron Lab`s team is excited about the demonstrated results and going to try other incredible solutions that MediaPipe provides. 

We will also keep track of MediaPipe updates, especially about adding more flexibility in creating custom calculators for our own models and reducing barriers to entry when creating them. Since at the moment our classifier model is outside the graph, such improvements would make it possible to quickly implement a custom calculator with our model into reality.

Another highly anticipated feature is Flutter support (especially for iOS). In the original plans, the inference and visualisation were supposed to be on a smartphone with NPUGPU utilisation, but at the moment support quality does not satisfy our requests. Flutter is a very powerful tool for rapid prototyping and concept checking. It allows us to throw and test an idea cross-platform without involving a dedicated mobile developer, so such support is highly demanded. 

Nevertheless, the development of this demo project continues with available functionality, and there are already several plans for the future. Like using the MediaPipe Holistic for face recognition and subsequent authorisation. The drone will be able to authorise the operator and give permission for gesture control. It also opens the way to personalisation. Since the classifier network is straightforward, each user will be able to customise gestures for themselves (simply by using another version of the classifier model). Depending on the authorised user, one or another saved model will be applied. Also in the plans to add the usage of Z-axis. For example, tilt the palm of your hand to control the speed of movement or height more precisely. We encourage developers to innovate responsibly in this area, and to consider responsible AI practices such as testing for unfair biases and designing with safety and privacy in mind.

We highly believe that this project will motivate even small teams to do projects in the field of ML computer vision for the UAV, and MediaPipe will help to cope with the limitations and difficulties on their way (such as scalability, cross-platform support and GPU inference).

If you want to contribute, have ideas or comments about this project, please reach out to [email protected], or visit the GitHub page of the project.

This blog post is curated by Igor Kibalchich, ML Research Product Manager at Google AI.

Video-Touch: Multi-User Remote Robot Control in Google Meet call by DNN-based Gesture Recognition

A guest post by the Engineering team at Video-Touch

Please note that the information, uses, and applications expressed in the below post are solely those of our guest author, Video-Touch.

A guest post by Video-Touch

You may have watched some science fiction movies where people could control robots with the movements of their bodies. Modern computer vision and robotics approaches allow us to make such an experience real, but no less exciting and incredible.

Inspired by the idea to make remote control and teleoperation practical and accessible during such a hard period of coronavirus, we came up with a VideoTouch project.

Video-Touch is the first robot-human interaction system that allows multi-user control via video calls application (e.g. Google Meet, Zoom, Skype) from anywhere in the world.

The Video-Touch system in action.

Figure 1: The Video-Touch system in action: single user controls a robot during a Video-Touch call. Note the sensors’ feedback when grasping a tube [source video].

We were wondering if it is even possible to control a robot remotely using only your own hands – without any additional devices like gloves or a joystick – not suffering from a significant delay. We decided to use computer vision to recognize movements in real-time and instantly pass them to the robot. Thanks to MediaPipe now it is possible.

Our system looks as follows:

  1. Video conference application gets a webcam video on the user device and sends it to the robot computer (“server”);
  2. User webcam video stream is being captured on the robot’s computer display via OBS virtual camera tool;
  3. The recognition module reads user movements and gestures with the help of MediaPipe and sends it to the next module via ZeroMQ;
  4. The robotic arm and its gripper are being controlled from Python, given the motion capture data.

Figure 2: Overall scheme of the Video-Touch system: from users webcam to the robot control module [source video].

As it clearly follows from the scheme, all the user needs to operate a robot is a stable internet connection and a video conferencing app. All the computation, such as screen capture, hand tracking, gesture recognition, and robot control, is being carried on a separate device (just another laptop) connected to the robot via Wi-Fi. Next, we describe each part of the pipeline in detail.

Video stream and screen capture

One can use any software that sends a video from one computer to another. In our experiments, we used the video conference desktop application. A user calls from its device to a computer with a display connected to the robot. Thus it can see the video stream from the user’s webcam.

Now we need some mechanism to pass the user’s video from the video conference to the Recognition module. We use Open Broadcaster Software (OBS) and its virtual camera tool to capture the open video conference window. We get a virtual camera that now has frames from the users’ webcam and its own unique device index that can be further used in the Recognition module.

Recognition module

The role of the Recognition module is to capture a users’ movements and pass them to the Robot control module. Here is where the MediaPipe comes in. We searched for the most efficient and precise computer vision software for hand motion capture. We found many exciting solutions, but MediaPipe turned out to be the only suitable tool for such a challenging task – real-time on-device fine-grained hand movement recognition.

We made two key modifications to the MediaPipe Hand Tracking module: added gesture recognition calculator and integrated ZeroMQ message passing mechanism.

At the moment of our previous publication we had two versions of the gesture recognition implementation. The first version is depicted in Figure 3 below and does all the computation inside the Hand Gesture Recognition calculator. The calculator has scaled landmarks as its input, i.e. these landmarks are normalized on the size of the hand bounding box, not on the whole image size. Next it recognizes one of 4 gestures (see also Figure 4): “move”, “angle”, “grab” and “no gesture” (“finger distance” gesture from the paper was an experimental one and was not included in the final demo) and outputs the gesture class name. Despite this version being quite robust and useful, it is based only on simple heuristic rules like: “if this landmark[i].x < landmark[j].x then it is a `move` gesture”, and is failing for some real-life cases like hand rotation.

Modified MediaPipe Hand Landmark CPU subgraph.

Figure 3: Modified MediaPipe Hand Landmark CPU subgraph. Note the HandGestureRecognition calculator

To alleviate the problem of bad generalization, we implemented the second version. We trained the Gradient Boosting classifier from scikit-learn on a manually collected and labeled dataset of 1000 keypoints: 200 per “move”, “angle” and “grab” classes, and 400 for “no gesture” class. By the way, today such a dataset could be easily obtained using the recently released Jesture AI SDK repo (note: another project of some of our team members).

We used scaled landmarks, angles between joints, and pairwise landmark distances as an input to the model to predict the gesture class. Next, we tried to pass only the scaled landmarks without any angles and distances, and it resulted in similar multi-class accuracy of 91% on a local validation set of 200 keypoints. One more point about this version of gesture classifier is that we were not able to run the scikit-learn model from C++ directly, so we implemented it in Python as a part of the Robot control module.

Figure 4: Gesture classes recognized by our model (“no gesture” class is not shown).

Figure 4: Gesture classes recognized by our model (“no gesture” class is not shown).

Right after the publication, we came up with a fully-connected neural network trained in Keras on just the same dataset as the Gradient Boosting model, and it gave an even better result of 93%. We converted this model to the TensorFlow Lite format, and now we are able to run the gesture recognition ML model right inside the Hand Gesture Recognition calculator.

Figure 5: Fully-connected network for gesture recognition converted to TFLite model format.

Figure 5: Fully-connected network for gesture recognition converted to TFLite model format.

When we get the current hand location and current gesture class, we need to pass it to the Robot control module. We do this with the help of the high-performance asynchronous messaging library ZeroMQ. To implement this in C++, we used the libzmq library and the cppzmq headers. We utilized the request-reply scenario: REP (server) in C++ code of the Recognition module and REQ (client) in Python code of the Robot control module.

So using the hand tracking module with our modifications, we are now able to pass the motion capture information to the robot in real-time.

Robot control module

A robot control module is a Python script that takes hand landmarks and gesture class as its input and outputs a robot movement command (on each frame). The script runs on a computer connected to the robot via Wi-Fi. In our experiments we used MSI laptop with the Nvidia GTX 1050 Ti GPU. We tried to run the whole system on Intel Core i7 CPU and it was also real-time with a negligible delay, thanks to the highly optimized MediaPipe compute graph implementation.

We use the 6DoF UR10 robot by Universal Robotics in our current pipeline. Since the gripper we are using is a two-finger one, we do not need a complete mapping of each landmark to the robots’ finger keypoint, but only the location of the hands’ center. Using this center coordinates and python-urx package, we are now able to change the robots’ velocity in a desired direction and orientation: on each frame, we calculate the difference between the current hand center coordinate and the one from the previous frame, which gives us a velocity change vector or angle. Finally, all this mechanism looks very similar to how one would control a robot with a joystick.

Hand-robot control logic follows the idea of a joystick with pre-defined movement directions

Figure 6: Hand-robot control logic follows the idea of a joystick with pre-defined movement directions [source video].

Tactile perception with high-density tactile sensors

Dexterous manipulation requires a high spatial resolution and high-fidelity tactile perception of objects and environments. The newest sensor arrays are well suited for robotic manipulation as they can be easily attached to any robotic end effector and adapted at any contact surface.

High fidelity tactile sensor array

Figure 7: High fidelity tactile sensor array: a) Array placement on the gripper. b) Sensor data when the gripper takes a pipette. c) Sensor data when the gripper takes a tube [source publication].

Video-Touch is embedded with a kind of high-density tactile sensor array. They are installed in the two-fingers robotic gripper. One sensor array is attached to each fingertip. A single electrode array can sense a frame area of 5.8 [cm2] with a resolution of 100 points per frame. The sensing frequency equals 120 [Hz]. The range of force detection per point is of 1-9 [N]. Thus, the robot detects the pressure applied to solid or flexible objects grasped by the robotic fingers with a resolution of 200 points (100 points per finger).

The data collected from the sensor arrays are processed and displayed to the user as dynamic finger-contact maps. The pressure sensor arrays allow the user to perceive the grasped object’s physical properties such as compliance, hardness, roughness, shape, and orientation.

Multi-user robotic arm control feature.

Figure 8: Multi-user robotic arm control feature. The users are able to perform a COVID-19 test during a regular video call [source video].

Endnotes

Thus by using MediaPipe and a robot we built an effective, multi-user robot teleoperation system. Potential future uses of teleoperation systems include medical testing and experiments in difficult-to-access environments like outer space. Multi-user functionality of the system addresses an actual problem of effective remote collaboration, allowing to work on projects which need manual remote control in a group of several people.

Another nice feature of our pipeline is that one could control the robot using any device with a camera, e.g. a mobile phone. One also could operate another hardware form factor, such as edge devices, mobile robots, or drones instead of a robotic arm. Of course, the current solution has some limitations: latency, the utilization of z-coordinate (depth), and the convenience of the gesture types could be improved. We can’t wait for the updates from the MediaPipe team to try them out, and looking forward to trying new types of the gripper (with fingers), two-hand control, or even a whole-body control (hello, “Real Steel”!).

We hope the post was useful for you and your work. Keep coding and stay healthy. Thank you very much for your attention!

This blog post is curated by Igor Kibalchich, ML Research Product Manager at Google AI

Bringing artworks to life with AR

Posted by Richard Adem, UX Engineer at Google Arts & Culture

What is Art Filter?

One of the best ways to learn about global culture is by trying on famous art pieces using Google’s Augmented Reality technology on your mobile device. What does it feel like to wear a three thousand year old necklace, put on a sixteenth century Japanese helmet or don pearl earrings and pose in a Vermeer?

Google Arts & Culture have created a new feature called Art Filter allowing everyone to learn about culturally significant art pieces from around the world and put themselves inside famous paintings, normally safely displayed in a museum.

We teamed up with the MediaPipe team, which offers cross-platform, customizable ML solutions to combine ML with rendering to generate stunning visuals.

Working closely with the MediaPipe team to utilize their face mesh and 3D face transform allowed us to create custom effects for each of the artifacts we had chosen, and easily display them on as part of the Google Arts & Culture iOS and Android app.

gif of the art filter feature

Figure 1. The Art Filter feature.

The Challenges

We selected five iconic cultural treasures from around the world:

Given their diverse formats or textures each artwork or object required special approaches to bring it to life in AR.

gif of user wearing jewelry on art filter feature

Figure 2. User wearing the jewelry from Johannes Vermeer’s “Girl with a Pearl Earring” – Mauritshuis museum, Hague.

Creating 3D objects that can be viewed from all sides, using 2D references.

Some of the artwork we selected are 2D paintings and we wanted everyone to immerse themselves in the paintings. Our team of 3D artists and designers took high resolution gigapixel images from Google Arts & Culture and projected them onto 3D meshes to texture them. We also extended the 2D textures all the way around the 3D meshes while maintaining the style of the original artist. This means that when you turn your head the previously hidden parts of the piece are viewable from every angle, mimicking how the object would look in real-life.

Gif of the Van Gogh Self-Portrait filter

Figure 3. The Van Gogh Self-Portrait filter – Musée d’Orsay, Paris.

Our cultural partners were immensely helpful during the creation of Art Filter. They have sourced a huge amount of reference images allowing us to reproduce the pieces accurately using photographs from different angles, to help them appear to fit into the “real world” in AR (using size comparisons).

Layering elements of the effect along with the image of the user.

Art Filter takes an image of the user from their device’s camera and uses that to generate a 3D mesh of the user’s face. All processing of user images or video feeds is run entirely on device. We do not use this feature to identify or collect any personal biometric data; the feature cannot be used to identify an individual.

The image is then reused to texture the face mesh, generated in real-time on-device with MediaPipe Face Mesh, representing it in the virtual 3D world within the device. We then add virtual 2D and 3D layers around the face to complete the effect. The Tengu Helmet, for example, sits on top of the face mesh in 3D and is “attached” to the face mesh so it moves around when the user moves their head around. The Vermeer earrings with a headscarf and Frida Kahlo’s necklace are attached to the user’s image in a similar way. The Van Gogh effect works slightly differently since we still use a mesh of the user’s face but this time we apply a texture from the painting.

We use 2D elements to complete the scene as well, such as the backgrounds in the Kahlo and Van Gogh paintings. These were created by carefully separating the painting subjects from the background then placing them behind the user in 3D. You may notice that Van Gogh’s body is also 2D, shown as a “billboard” so that it always faces the camera.

Figure 4. Creating the 3D mesh showing layers and masks.

Figure 4. Creating the 3D mesh showing layers and masks.

Using shaders for different materials such as the metal helmet.

To create a realistic looking material we used “Physically Based” Rendering shaders. You can see this on the Tengu helmet, it has a bumpy surface that is affected by the real life light captured by the device. This requires creating extra textures, texture maps, for the effect that uses colors to represent how bumpy or shiny the 3D object should appear. Texture maps look like bright pink and blue images but tell the renderer about tiny details on the surface of the object without creating any extra polygons, which can slow down the frame rate of the feature.

Figure 5. User wearing Helmet with Tengu Mask and Crows - The Metropolitan Museum of Art

Figure 5. User wearing Helmet with Tengu Mask and Crows – The Metropolitan Museum of Art.

Conclusion

We hope you enjoy the collection we have created in Art Filter. Please visit and try for yourself! You can also explore more amazing ML features with Google Arts & Culture such as Art Selfie and Art Transfer.

We hope to bring many more filters to the feature and are looking forward to new features from MediaPipe.

Control your Mirru prosthesis with MediaPipe hand tracking


Guest post by the Engineering teams at Mirru and Tweag

What is the Mirru App?

Mirru App logo

Mirru is a free and open source Android app under development with which one can control robotic prosthetic hands via hand tracking. With our app, a user can instantly mirror grips from their sound hand onto a robotic one, which could be 3D-printed and self-assembled at low cost. With Mirru, we want to provide a cheap, intuitive and open end-to-end alternative to existing, costly, cumbersome and proprietary technology.

A demonstration of using MediaPipe hand tracking to move a robotic hand’s fingers with the Mirru app.

Figure 1: A demonstration of using MediaPipe hand tracking to move a robotic hand’s fingers with the Mirru app.

The Mirru team is a collaboration between Violeta López and Vladimir Hermand, two independent designers and technologists currently based in Paris. To kickstart the project, the team took part in Tweag’s Open Source Fellowship program which provided funding, mentorship and data engineering expertise from one of their engineers, Dorran Howell. The fellowship helped get Mirru launched from the ground-up.

Our goal for the 3-month fellowship was to develop an initial version of the Android app that can control any bluetooth open source hand using computer vision techniques, and make the app available for free on the Google Play store so anyone can print their own hand, assemble it, and download the app. With the help of MediaPipe, we were able to quickly prototype our app without having to build our own machine learning model, as we didn’t have the resources or training data to do so.

Why use hand tracking?

Using your phone and a front-facing camera with hand tracking opens up a new, affordable, accessible, and versatile way to control prosthetics.

Let’s say I’m a left hand amputee who owns a robotic prosthesis. Every day, I need my prosthetic hand to actuate a lot of different grip patterns. For example, I need to use a pinch or tripod grip to pick up small objects, or a fist grip to pick up objects like a fruit or a cup. I change and execute these grip patterns via myoelectric muscle sensors that allow me to, for example, open and close a grip by flexing and unflexing my upper limb muscles. These myoelectric muscle sensors are the main interface between my body and the prosthesis.

However, living with them is not as easy as it seems. Controlling the myoelectric sensors can take a lot of time to get used to, and many never do. It can also be quite expensive to get these sensors fitted by a prosthetist, especially for people in developing countries or anyone without health insurance. Finally, the number of grips on many devices currently on the market is limited to less than ten, and only few models come with ways to create custom grips, which are often cumbersome.

Mirru provides an alternative interface. Using just their phone, a tool many have access to, a user can digitally mirror their sound hand in real-time and communicate with their prosthesis in an intuitive way. This removes the expensive need to be fitted by a prosthetist and enables the user to quickly program an unlimited amount of grips. For now, Mirru stays away from electromyography altogether as reliable muscle sensors are expensive. The programmed grips therefore need to be triggered via the android phone, which is why this first version of our app is more suited for activities like sweeping, holding a book while reading it, or holding a cup or shopping bag. In the future we hope to combine myoelectric sensors with hand tracking to get the benefits of both.

Programming a grip with the Mirru app looks like the following: Let’s say that I want to grab an object with my robotic hand. I bring my prosthesis near the object and I then form the desired grip with my sound hand in front of my android phone and Mirru mirrors it in real-time to the prosthesis. I then lock my prosthesis into this new grip and free up my vsound hand. Finally I might save this grip for later use and add it to my library of grips.

A user tester using hand-tracking on their phone to program their prosthesis’s grip to pick up a measuring tape and measure with the other hand.

Figure 2: A user tester using hand-tracking on their phone to program their prosthesis’s grip to pick up a measuring tape and measure with the other hand.

The Brunel Hand and the Mirru Arduino Sketch

In order to accomplish our goal of allowing as many people as possible to print, assemble, and control their own hand, we designed the Mirru android app to be compatible with any robotic hand that is controlled by a bluetooth-enabled Arduino board and servo motors.

For our project, we printed and assembled an open source robotic hand called the Brunel Hand made by Open Bionics. First, we 3D printed the Brunel Hand’s 3D printable files that are made available under the CC Attribution-Sharealike 4.0 International License. We then bought the necessary servos, springs, and screws to assemble the hand. In combination with printing and buying the servos, the hand costs around €500 to purchase and assemble.

The Brunel Hand comes with myoelectric-based firmware and a PCB board developed by Open Bionics, but since the hand is in essence just 4 servo motors, any microcontroller could be used. We ended up using an Adafruit ESP32 feather board for its bluetooth capabilities and created an Arduino sketch that can be downloaded, customized, and uploaded for anyone who is printing and assembling their own hand. They could then download the Mirru app to use as the control-interface for their printed hand.

Hand-tracking with MediaPipe

There are many computer vision solutions available for hand tracking that could be used for this project, but we needed a fast, open source solution that didn’t require us to train our own model, and that could be used reliably on a portable device such as a phone.

MediaPipe provides great out of the box support for hand tracking, and since we didn’t have the training data or resources available to create a model from scratch, it was perfect for our team. We were able to build the Android example apps easily and were excited to find that the performance was promising. Even better, no tweaking on the ready-made hand tracking model or the graphs was necessary, as the hand landmark model provided all the necessary outputs for our prototype.

When testing the prosthesis on real users, we were happy to hear that many of them were impressed with how fast the app was able to translate their movements, and that nothing else exists on the market that allows you to make custom grips as fast and on-the-fly.

A user tester demonstrates how quickly the MediaPipe hand-tracking can translate her moving fingers to the movement of her prosthesis’s fingers.

Figure 3: A user tester demonstrates how quickly the MediaPipe hand-tracking can translate her moving fingers to the movement of her prosthesis’s fingers.

Translating 3D MediaPipe points into inputs for Robotics

To achieve the goals of the Mirru app, we need to use hand tracking to independently control each finger of the Brunel Hand in real-time. In the Brunel Hand, the index, middle, and ring fingers are actuated using servos that move at an angle from 0 to 180 degrees; 0 means the finger is fully upright and 180 means the finger is fully flexed down. As we lacked adequate training data to create a model from scratch that could calculate these servo angles for us, we opted to use a heuristic to relate the default hand tracking landmark outputs to the inputs required by our hardware for an initial version of our prototype.

In the lab testing the translation of the outputs to inputs with the app and the prototype.

Figure 4: In the lab testing the translation of the outputs to inputs with the app and the prototype.

We were initially unsure whether the estimated depth (Z) coordinate in the 3D landmarks would be accurate enough for the translation of inputs or if we would be limited to working in 2D. As an initial step, we recorded an example dataset and spun up a visualization of the points in a Jupyter Notebook with Plotly. We were immediately impressed by the quality and accuracy of the coordinates, considering that the technology only uses a single camera without any depth sensors. As noted in the MediaPipe documentation, The Z coordinates have a slightly different scale than the X/Y coordinates, but this didn’t seem to pose a significant challenge for our prototype.

A data visualization of the hand made up of 21 3D hand landmarks provided by MediaPipe.

Figure 5: A data visualization of the hand made up of 21 3D hand landmarks provided by MediaPipe.

Given the accuracy of the 3D landmarks, we opted to use a calculation in 3D for relating landmark outputs to the inputs required by the prosthesis. In our approach, we calculate the acute angles of the fingers in relation to the palm by calculating the angle between the finger direction and the normal of the plane defined by the palm. An angle of 0° corresponds to maximum closure of the finger, and an angle of 180° indicates a fully extended finger. We were able to calculate the finger direction by calculating the vector from the landmark at the base of the fingers to the landmark on the tip of the fingers.

Diagram showing the 3D landmarks and which ones we used to calculate the finger direction vector, the palm normal, and the angle that both form.

Figure 6: Diagram showing the 3D landmarks and which ones we used to calculate the finger direction vector, the palm normal, and the angle that both form.

We calculate the palm normal by selecting three points in the plane of the palm. Using Landmark 0 as the reference point, we calculate the vectors for side 1 and side 2, and compute the cross product of those vectors to give us the palm normal. Finally, we compute the angle of the finger direction and the palm normal. This returns an angle in radians that we use to calculate degrees.

We had to do some extra processing to match the degrees of freedom for the thumb on our prosthetic hand. The thumb moves in more complex ways than the rest of the fingers. In order to get our app to work with the thumb, we did similar calculations for thumb direction and the palm normal, but we used different landmarks.

Once we do the calculation of the servo angles on the android phone, we send those values via bluetooth to the Arduino board, and the Arduino board moves the servos to the proper position. Due to some noise in the model outputs, we add a smoothing step to the pipeline, which is important so that the movements of the robotic fingers aren’t too jittery for precise grips.

A user tester makes a pinch grip on her prosthesis with the Mirru app.

Figure 7: A user tester makes a pinch grip on her prosthesis with the Mirru app.

Summary

The Mirru app and Mirru Arduino Sketch are designed to allow anyone to control an open source prosthesis with their sound hand and an Android phone. This is a novel and frugal alternative to muscle sensors, and MediaPipe has proven that it is the right tool for the essential hand tracking component of the full application. The Mirru team was able to get started quickly with MediaPipe’s out of the box solutions without having to gather any training data or having to design a model from scratch. The speed of the real-time translation from hand tracking points to the robotic hand has especially excited all of our users in our testing sessions and opens up many possibilities for the future of prostheses.

We see exciting potential for combining the MediaPipe hand tracking features with existing myoelectric prostheses which could open powerful and advanced ways to create and save custom prosthesis grips in real-time. Also, with the help of MediaPipe, we have been able to provide an open source alternative to proprietary prostheses without the need for myoelectric sensors or a visit to a prosthetist, at a cost that is much lower than what is already on the market, and whose source code can be customized and built-upon by other developers. Our team is excited to see what other ideas the open source community might come up with, and to see what hand tracking can bring to users and manufacturers of prostheses.

As for the current state of the Mirru application, we have yet to implement the possibility of recording and saving moving gestures that are longer sequences compared to the static grip positions. For example, imagine being able to record the movement of the fingers to play a bass line on a piano, like a loopable animated gif. There is a realm of possibilities for prostheses that is waiting to be explored, and we’re really happy that MediaPipe gives us access to it.

We are looking for contributors. If you have ideas or comments about this application, please reach out to [email protected], or visit our GitHub.

This blog post is curated by Igor Kibalchich, ML Research Product Manager at Google AI.

SignAll SDK: Sign language interface using MediaPipe is now available for developers

A guest post by the Engineering team at SignAll | Twitter handle | MediaPipe team

Please note that the information, uses, and applications expressed in the below post are solely those of our guest author, SignAll.

SignAll SDK: Sign language interface using MediaPipe is now available for developers

When Google published the first versions of its on-device hand tracking technology in MediaPipe, the work could serve as a basis for developers to build sign language recognition solutions into their own apps. Later updates to this hand tracking solution have further improved its accuracy where other technologies have fallen short (Figure 1).

Illustrating MediaPipe’s improvement over time: hand skeleton tracking output of an older version (2020.02.10) and the latest version (2020.12.16).

Figure 1. Illustrating MediaPipe’s improvement over time: hand skeleton tracking output of an older version (2020.02.10) and the latest version (2020.12.16). This handshape is used in sign language frequently, but often missed because of the lack of representation in training datasets.
Watch the full video

SignAll is a startup working on sign language translation technology. Its mission is to make sign language interpretation universally available, both through communication between Deaf and hearing parties, and between Deaf individuals and computers. The SignAll’s products, which are used nationwide in the US for both communications and education, employ a complex multi-camera setup and gloves with colored markers. While sign languages’ complexity goes far beyond handshapes (facial features, body, grammar, etc.), it is true that the accurate tracking of the hands has been a huge obstacle in the first layer of processing – computer vision. MediaPipe unlocked the possibility to offer SignAll’s solutions not only glove free, but also by using a single camera. SignAll has just announced the availability of the first SDK of its kind, so developers can now enable sign language input in their apps.

The company recently published an interactive educational app in the App Store, that lets the user practice signing with immediate feedback, this app also serves as a demonstration of the possibilities with the SDK.

SignAll with MediaPipe Hands

Our system uses several layers for sign recognition, and each one uses more and more abstract data. The low-level layer extracts crucial hand, body, and face data from 2D and 3D cameras. In our first implementation, this layer detects the colors of the gloves and creates 3D hand data. Replacing this with MediaPipe Hands (supplemented by the MediaPipe Pose and MediaPipe Face Mesh) has been a game changer for using our system without gloves or special lighting.

Demo of our SignAll SDK developed using MediaPipe asking how are you
Demo greeting of our SignAll SDK developed using MediaPipe

Figure 2. Demo of our SignAll SDK developed using MediaPipe

As mentioned earlier, we use multiple cameras with depth sensors which are calibrated in the real world. This allows for a more accurate 3D world space than that of a local camera or tensor spaces, but requires the use of hand landmark detection for each camera. The cameras are placed distinctly to each other in position and orientation so the hands are visible more frequently, as one hand might cover the other from one camera but not necessarily from the others.

Corrected 3D hand shape based on three cameras. Due to the unique orientation, the detection from the front camera is wrong, but the side camera can correct the result.

Figure 3. Corrected 3D hand shape based on three cameras. Due to the unique orientation, the detection from the front camera is wrong, but the side camera can correct the result.

The next step is to filter and smooth the data to replicate the precise measurements offered by our colored glove markers. Although SignAll’s markers are different from the landmarks given by MediaPipe, we used our hand model to generate colored markers from landmarks. Therefore, the new mocap data is fully compatible with the previous one.

Although we are focused mainly on the hands, we also integrated MediaPipe Pose and MediaPipe Face Mesh. The pose landmarks provide accurate hand position information, even when touching or close to each other.

While the two versions of mocap are compatible, the nature of the artifacts is different: direct measure of each marker and simulated markers from a globally detected hand. Due to this discrepancy, we had to refine the parameters on higher levels. On the other hand, we could still use our huge sign database for the gloveless configuration. By replacing the low-level data and refining our higher-level data, we could test our system without gloves. Going gloveless can be a huge step for using our sign recognition technology easily worldwide.

Demonstration of compatible mocap from different low-level trackings.

Figure 4. Demonstration of compatible mocap from different low-level trackings. The right side is without gloves; the left side is with gloves. This compatibility enables the usage of SignAll’s meticulously labeled dataset of 300,000+ sign language videos to be used for the training of recognition models based on different low-level data. Watch the full video

The SignAll system using MediaPipe framework

After integrating MediaPipe Hands into our system, we also wanted to take advantage of the customization and scaling opportunities provided by the MediaPipe framework on multiple platforms. This allowed us to not only prototype our research state methods in Python, but also deliver our end-user solutions for Windows, iOS, Android, and even the Web. Thanks to the similarities between our module graph system and the calculator graph of MediaPipe, our existing processing units can be reused in this new framework with minor modifications. With that said, the extended platform set also comes with other challenges, like using only a single 2D camera in most cases instead of a calibrated multi-camera system.

The models, algorithms and techniques we have used were mostly developed to work on our mocap data interpreted in the 3D global world. The data extracted from a single-camera setup, of course, cannot be as detailed. That is the reason we had to make some adjustments to our implementations, fine-tune the algorithms and add some extra logic (e.g., dynamically adapting to the changes of space resulted by the hand-held camera use-case). Luckily, the MediaPipe framework enables us to implement the core processing units in C++, so we can still benefit from the runtime-optimized core solutions we previously developed.

Some higher-level models trained on 3D data also needed to be re-trained in order to perform better on the data originating from a single 2D source. The MediaPipe landmarks are defined by 3D coordinates, which makes it possible to reuse the existing training methods and concepts. On the other hand, the 2D information is more directly extracted and therefore more stable than the third coordinate, which was taken into consideration while designing the training modifications.

Luckily, it is not necessary to make an entirely new data recording for this purpose. We can still use our huge video database annotated in great detail. The preprocessed mocap data we can extract from our recordings and interpreted in the 3D world can be used to simulate hand, skeleton, or face landmark detections in any virtual camera view.

Among the data on virtual camera views, we also use traditional 2D recordings in sufficient proportion to cover the unique noise characteristics of the landmark detections. Thanks to having most of these types of data in advance, we can focus on the most exciting part – trying the latest techniques and training new models.

Summary

The advancement made possible by MediaPipe enabled SignAll to change its model. In addition to offering all-in-one products for sign language education and translation, SignAll is now starting to offer an SDK for developers. The capabilities of this SDK depend on the type of the camera or cameras being used and the available computational capacity. The possible functions enabled by using the SDK vary from launching video calls by signing the contact’s name (watch a demo here), adding addresses into navigation by signing (as a counterpart to speech input), or ordering food on a fast-food restaurant’s kiosk or drive-thru. With its mission to make sign language an alternative everywhere that voice can be used, SignAll is excited to see more and more apps implementing this feature.

We are eager to try future updates of MediaPipe, which could bring us closer to our ultimate goal of having our solutions available for everyone on any device. The most awaited update is the ability to build custom MediaPipe graphs and add our own calculators for web-based solutions aided by the WebAssembly technology, so websites will be able to use a new level of accessibility features for Deaf visitors.

MediaPipe 3D Face Transform

Posted by Kanstantsin Sokal, Software Engineer, MediaPipe team

Earlier this year, the MediaPipe Team released the Face Mesh solution, which estimates the approximate 3D face shape via 468 landmarks in real-time on mobile devices. In this blog, we introduce a new face transform estimation module that establishes a researcher- and developer-friendly semantic API useful for determining the 3D face pose and attaching virtual objects (like glasses, hats or masks) to a face.

The new module establishes a metric 3D space and uses the landmark screen positions to estimate common 3D face primitives, including a face pose transformation matrix and a triangular face mesh. Under the hood, a lightweight statistical analysis method called Procrustes Analysis is employed to drive a robust, performant and portable logic. The analysis runs on CPU and has a minimal speed/memory footprint on top of the original Face Mesh solution.

MediaPipe image

Figure 1: An example of virtual mask and glasses effects, based on the MediaPipe Face Mesh solution.

Introduction

The MediaPipe Face Landmark Model performs a single-camera face landmark detection in the screen coordinate space: the X- and Y- coordinates are normalized screen coordinates, while the Z coordinate is relative and is scaled as the X coordinate under the weak perspective projection camera model. While this format is well-suited for some applications, it does not directly enable crucial features like aligning a virtual 3D object with a detected face.

The newly introduced module moves away from the screen coordinate space towards a metric 3D space and provides the necessary primitives to handle a detected face as a regular 3D object. By design, you’ll be able to use a perspective camera to project the final 3D scene back into the screen coordinate space with a guarantee that the face landmark positions are not changed.

Metric 3D Space

The Metric 3D space established within the new module is a right-handed orthonormal metric 3D coordinate space. Within the space, there is a virtual perspective camera located at the space origin and pointed in the negative direction of the Z-axis. It is assumed that the input camera frames are observed by exactly this virtual camera and therefore its parameters are later used to convert the screen landmark coordinates back into the Metric 3D space. The virtual camera parameters can be set freely, however for better results it is advised to set them as close to the real physical camera parameters as possible.

MediaPipe image

Figure 2: A visualization of multiple key elements in the metric 3D space. Created in Cinema 4D

Canonical Face Model

The Canonical Face Model is a static 3D model of a human face, which follows the 3D face landmark topology of the MediaPipe Face Landmark Model. The model bears two important functions:

  • Defines metric units: the scale of the canonical face model defines the metric units of the Metric 3D space. A metric unit used by the default canonical face model is a centimeter;
  • Bridges static and runtime spaces: the face pose transformation matrix is – in fact – a linear map from the canonical face model into the runtime face landmark set estimated on each frame. This way, virtual 3D assets modeled around the canonical face model can be aligned with a tracked face by applying the face pose transformation matrix to them.

Face Transform Estimation

The face transform estimation pipeline is a key component, responsible for estimating face transform data within the Metric 3D space. On each frame, the following steps are executed in the given order:

  • Face landmark screen coordinates are converted into the Metric 3D space coordinates;
  • Face pose transformation matrix is estimated as a rigid linear mapping from the canonical face metric landmark set into the runtime face metric landmark set in a way that minimizes a difference between the two;
  • A face mesh is created using the runtime face metric landmarks as the vertex positions (XYZ), while both the vertex texture coordinates (UV) and the triangular topology are inherited from the canonical face model.

Effect Renderer

The Effect Renderer is a component, which serves as a working example of a face effect renderer. It targets the OpenGL ES 2.0 API to enable a real-time performance on mobile devices and supports the following rendering modes:

  • 3D object rendering mode: a virtual object is aligned with a detected face to emulate an object attached to the face (example: glasses);
  • Face mesh rendering mode: a texture is stretched on top of the face mesh surface to emulate a face painting technique.

In both rendering modes, the face mesh is first rendered as an occluder straight into the depth buffer. This step helps to create a more believable effect via hiding invisible elements behind the face surface.

MediaPipe image

Figure 3: An example of face effects rendered by the Face Effect Renderer.

Using Face Transform Module

The face transform estimation module is available as a part of the MediaPipe Face Mesh solution. It comes with face effect application examples, available as graphs and mobile apps on Android or iOS. If you wish to go beyond examples, the module contains generic calculators and subgraphs – those can be flexibly applied to solve specific use cases in any MediaPipe graph. For more information, please visit our documentation.

Follow MediaPipe

We look forward to publishing more blog posts related to new MediaPipe pipeline examples and features. Please follow the MediaPipe label on Google Developers Blog and Google Developers twitter account (@googledevs).

Acknowledgements

We would like to thank Chuo-Ling Chang, Ming Guang Yong, Jiuqiang Tang, Gregory Karpiak, Siarhei Kazakou, Matsvei Zhdanovich and Matthias Grundman for contributing to this blog post.

Instant Motion Tracking with MediaPipe

Posted by Vikram Sharma, Software Engineering Intern; Jianing Wei, Staff Software Engineer; Tyler Mullen, Senior Software Engineer

Augmented Reality (AR) technology creates fun, engaging, and immersive user experiences. The ability to perform AR tracking across devices and platforms, without initialization, remains important for powering AR applications at scale.

Today, we are excited to release the Instant Motion Tracking solution in MediaPipe. It is built upon the MediaPipe Box Tracking solution we released previously. With Instant Motion Tracking, you can easily place fun virtual 2D and 3D content on static or moving surfaces, allowing them to seamlessly interact with the real world. This technology also powered MotionStills AR. Along with the library, we are releasing an open source Android application to showcase its capabilities. In this application, a user simply taps the camera viewfinder in order to place virtual 3D objects and GIF animations, augmenting the real-world environment.

gif of instant motion tracking in MediaPipe gif of instant motion tracking in MediaPipe

Instant Motion Tracking in MediaPipe

Instant Motion Tracking

The Instant Motion Tracking solution provides the capability to seamlessly place virtual content on static or motion surfaces in the real world. To achieve that, we provide the six degrees of freedom tracking with relative scale in the form of rotation and translation matrices. This tracking information is then used in the rendering system to overlay virtual content on camera streams to create immersive AR experiences.

The core concept behind Instant Motion Tracking is to decouple the camera’s translation and rotation estimation, treating them instead as independent optimization problems. This approach enables AR tracking across devices and platforms without initialization or calibration. We do this by first finding the 3D camera translation using only the visual signals from the camera. This involves estimating the target region’s apparent 2D translation and relative scale across frames. The process can be illustrated with a simple pinhole camera model, relating translation and scale of an object in the image plane to the final 3D translation.

image

By finding the change in relative size of our tracked region from view position V1 to V2, we can estimate the relative change in distance from the camera.

Next, we obtain the device’s 3D rotation from its built-in IMU (Inertial Measurement Unit) sensor. By combining this translation and rotation data, we can track a target region with six degrees of freedom at relative scale. This information allows for the placement of virtual content on any system with a camera and IMU functionality, and is calibration free. For more details on Instant Motion Tracking, please refer to our paper.

A MediaPipe Pipeline for Instant Motion Tracking

A diagram of Instant Motion Tracking pipeline is shown below, consisting of four major components: a Sticker Manager module, a Region Tracking module, a Matrices Manager module, and lastly a Rendering System. Each of the components consists of MediaPipe calculators or subgraphs.

Diagram

Diagram of Instant Motion Tracking Pipeline

The Sticker Manager accepts sticker data from the application and produces initial anchors (tracked region information) based on user taps, and user gesture controls for every sticker object. Initial anchors are then sent to our Region Tracking module to generate tracked anchors. The Matrices Manager combines this data with our device’s rotation matrix to produce six degrees-of-freedom poses as model matrices. After integrating any user-specified transforms like asset scaling, our final poses are forwarded to the Rendering System to render all virtual objects overlaid on the camera frame to produce the output AR frame.

Using the Instant Motion Tracking Solution

The Instant Motion Tracking solution is easy to use by leveraging the MediaPipe cross-platform framework. With camera frames, device rotation matrix, and anchor positions (screen coordinates) as input, the MediaPipe graph produces AR renderings for each frame, providing engaging experiences. If you wish to integrate this Instant Motion Tracking library with your system or application, please visit our documentation to build your own AR experiences on any device with IMU functionality and a camera sensor.

Augmenting The World with 3D Stickers and GIFs

Instant Motion Tracking solution allows bringing both 3D stickers and GIF animations into Augmented Reality experiences. GIFs are rendered on flat 3D billboards placed in the world, introducing fun and immersive experiences with animated content blended into the real environment.Try it for yourself!

Demonstration of GIF placement in 3D Demonstration of GIF placement in 3D

Demonstration of GIF placement in 3D

MediaPipe Instant Motion Tracking is already helping PixelShift.AI, a startup applying cutting-edge vision technologies to facilitate video content creation, to track virtual characters seamlessly in the view-finder for a realistic experience. Building upon Instant Motion Tracking’s high-quality pose estimation, PixelShift.AI enables VTubers to create mixed reality experiences with web technologies. The product is going to be released to the broader VTuber community later this year.

Instant

Instant Motion Tracking helps PixelShift.AI create mixed reality experiences

Follow MediaPipe

We look forward to publishing more blog posts related to new MediaPipe pipeline examples and features. Please follow the MediaPipe label on Google Developers Blog and Google Developers twitter account (@googledevs).

Acknowledgement

We would like to thank Vikram Sharma, Jianing Wei, Tyler Mullen, Chuo-Ling Chang, Ming Guang Yong, Jiuqiang Tang, Siarhei Kazakou, Genzhi Ye, Camillo Lugaresi, Buck Bourdon, and Matthias Grundman for their contributions to this release.

Alfred Camera: Smart camera features using MediaPipe

Guest post by the Engineering team at Alfred Camera

Please note that the information, uses, and applications expressed in the below post are solely those of our guest author, Alfred Camera.

In this article, we’d like to give you a short overview of Alfred Camera and our experience of using MediaPipe to transform our moving object feature, and how MediaPipe has helped to get things easier to achieve our goals.

What is Alfred Camera?

AlfredCamera logo

Fig.1 Alfred Camera Logo

Alfred Camera is a smart home app for both Android and iOS devices, with over 15 million downloads worldwide. By downloading the app, users are able to turn their spare phones into security cameras and monitors directly, which allows them to watch their homes, shops, pets anytime. The mission of Alfred Camera is to provide affordable home security so that everyone can find peace of mind in this busy world.

The Alfred Camera team is composed of professionals in various fields, including an engineering team with several machine learning and computer vision experts. Our aim is to integrate AI technology into devices that are accessible to everyone.

Machine Learning in Alfred Camera

Alfred Camera currently has a feature called Moving Object Detection, which continuously uses the device’s camera to monitor a target scene. Once it identifies a moving object in the area, the app will begin recording the video and send notifications to the device owner. The machine learning models for detection are hand-crafted and trained by our team using TensorFlow, and run on TensorFlow Lite with good performance even on mid-tier devices. This is important because the app is leveraging old phones and we’d like the feature to reach as many users as possible.

The Challenges

We had started building our AI features at Alfred Camera since 2017. In order to have a solid foundation to support our AI feature requirements for the coming years, we decided to rebuild our real-time video analysis pipeline. At the beginning of the project, the goals were to create a new pipeline which should be 1) modular enough so we could swap core algorithms easily with minimal changes in other parts of the pipeline, 2) having GPU acceleration designed in place, 3) cross-platform as much as possible so there’s no need to create/maintain separate implementations for different platforms. Based on the goals, we had surveyed several open source projects that had the potential but we ended up using none of them as they either fell short on the features or were not providing the readiness/stabilities that we were looking for.

We started a small team to prototype on those goals first for the Android platform. What came later were some tough challenges way above what we originally anticipated. We ran into several major design changes as some key design basics were overlooked. We needed to implement some utilities to do things that sounded trivial but required significant effort to make it right and fast. Dealing with asynchronous processing also led us into a bunch of timing issues, which took the team quite some effort to address. Not to mention debugging on real devices was extremely inefficient and painful.

Things didn’t just stop here. Our product is also on iOS and we had to tackle these challenges once again. Moreover, discrepancies in the behavior between the platform-specific implementations introduced additional issues that we needed to resolve.

Even though we finally managed to get the implementations to the confidence level we wanted, that was not a very pleasant experience and we have never stopped thinking if there is a better option.

MediaPipe – A Game Changer

Google open sourced MediaPipe project in June 2019 and it immediately caught our attention. We were surprised by how it is perfectly aligned with the previous goals we set, and has functionalities that could not have been developed with the amount of engineering resources we had as a small company.

We immediately decided to start an evaluation project by building a new product feature directly using MediaPipe to see if it could live up to all the promises.

Migrating to MediaPipe

To start the evaluation, we decided to migrate our existing moving object feature to see what exactly MediaPipe can do.

Our current Moving Object Detection pipeline consists of the following main components:

  • (Moving) Object Detection Model
    As explained earlier, a TensorFlow Lite model trained by our team, tailored to run on mid-tier devices.
  • Low-light Detection and Low-light Filter
    Calculate the average luminance of the scene, and based on the result conditionally process the incoming frames to intensify the brightness of the pixels to let our users see things in the dark. We are also controlling whether we should run the detection or not as the moving object detection model does not work properly when the frame has been processed by the filter.
  • Motion Detection
    Sending frames through Moving Object Detection still consumes a significant amount of power even with a small model like the one we created. Running inferences continuously does not seem to be a good idea as most of the time there may not be any moving object in front of the camera. We decided to implement a gating mechanism where the frames are only being sent to the Moving Object Detection model based on the movements detected from the scene. The detection is done mainly by calculating the differences between two frames with some additional tricks that take the movements detected in a few frames before into consideration.
  • Area of Interest
    This is a mechanism to let users manually mask out the area where they do not want the camera to see. It can also be done automatically based on regional luminance that can be generated by the aforementioned low-light detection component.

Our current implementation has taken GPU into consideration as much as we can. A series of shaders are created to perform the tasks above and the pipeline is designed to avoid moving pixels between CPU/GPU frequently to eliminate the potential performance hits.

The pipeline involves multiple ML models that are conditionally executed, mixed CPU/GPU processing, etc. All the challenges here make it a perfect showcase for how MediaPipe could help develop a complicated pipeline.

Playing with MediaPipe

MediaPipe provides a lot of code samples for any developer to bootstrap with. We took the Object Detection on Android sample that comes with the project to start with because of the similarity with the back-end part of our pipeline. It did take us sometimes to fully understand the design concepts of MediaPipe and all the tools associated. But with the complete documentation and the great responsiveness from the MediaPipe team, we got up to speed soon to do most of the things we wanted.

That being said, there were a few challenges we needed to overcome on the road to full migration. Our original pipeline of Moving Object Detection takes the input frame asynchronously, but MediaPipe has timestamp bound limitations such that we cannot just show the result in an allochronic way. Meanwhile, we need to gather data through JNI in a specific data format. We came up with a workaround that conquered all the issues under the circumstances, which will be mentioned later.

After wrapping our models and the processing logics into calculators and wired them up, we have successfully transformed our existing implementation and created our first MediaPipe Moving Object Detection pipeline like the figure below, running on Android devices:

Fig.2 Moving Object Detection Graph

Fig.2 Moving Object Detection Graph

We do not block the video frame in the main calculation loop, and set the detection result as an input stream to show the annotation on the screen. The whole graph is designed as a multi-functioned process, the left chunk is the debug annotation and video frame output module, and the rest of the calculation occurs in the rest of the graph, e.g., low light detection, motion triggered detection, cropping of the area of interest and the detection process. In this way, the graph process will naturally separate into real-time display and asynchronous calculation.

As a result, we are able to complete a full processing for detection in under 40ms on a device with Snapdragon 660 chipset. MediaPipe’s tight integration with TensorFlow Lite provides us the flexibility to get even more performance gain by leveraging whatever acceleration techniques available (GPU or DSP) on the device.

The following figure shows the current implementation working in action:

Fig.3 Moving Object Detection running in Alfred Camera

Fig.3 Moving Object Detection running in Alfred Camera

After getting things to run on Android, Desktop GPU (OpenGL-ES) emulation was our next target to evaluate. We are already using OpenGL-ES shaders for some computer vision operations in our pipeline. Having the capability to develop the algorithm on desktop, seeing it work in action before deployment onto mobile platforms is a huge benefit to us. The feature was not ready at the time when the project was first released, but MediaPipe team had soon added Desktop GPU emulation support for Linux in follow-up releases to make this possible. We have used the capability to detect and fix some issues in the graphs we created even before we put things on the mobile devices. Although it currently only works on Linux, it is still a big leap forward for us.

Testing the algorithms and making sure they behave as expected is also a challenge for a camera application. MediaPipe helps us simplify this by using pre-recorded MP4 files as input so we could verify the behavior simply by replaying the files. There is also built-in profiling support that makes it easy for us to locate potential performance bottlenecks.

MediaPipe – Exactly What We Were Looking For

The result of the evaluation and the feedback from our engineering team were very positive and promising:

  1. We are able to design/verify the algorithm and complete core implementations directly on the desktop emulation environment, and then migrate to the target platforms with minimum efforts. As a result, complexities of debugging on real devices are greatly reduced.
  2. MediaPipe’s modular design of graphs/calculators enables us to better split up the development into different engineers/teams, try out new pipeline design easily by rewiring the graph, and test the building blocks independently to ensure quality before we put things together.
  3. MediaPipe’s cross-platform design maximizes the reusability and minimizes fragmentation of the implementations we created. Not only are the efforts required to support a new platform greatly reduced, but we are also less worried about the behavior discrepancies on different platforms due to different interpretations of the spec from platform engineers.
  4. Built-in graphics utilities and profiling support saved us a lot of time creating those common facilities and making them right, and we could be more focused on the key designs.
  5. Tight integration with TensorFlow Lite really saves lots of effort for a company like us that heavily depends on TensorFlow, and it still gives us the flexibility to easily interface with other solutions.

With just a few weeks working with MediaPipe, it has shown strong capabilities to fundamentally transform how we develop our products. Without MediaPipe we could have spent months creating the same features without the same level of performance.

Summary

Alfred Camera is designed to bring home security with AI to everyone, and MediaPipe has significantly made achieving that goal easier for our team. From Moving Object Detection to future AI-powered features, we are focusing on transforming a basic security camera use case into a smart housekeeper that can help provide even more context that our users care about. With the support of MediaPipe, we have been able to accelerate our development process and bring the features to the market at an unprecedented speed. Our team is really excited about how MediaPipe could help us progress and discover new possibilities, and is looking forward to the enhancements that are yet to come to the project.

MediaPipe on the Web

Posted by Michael Hays and Tyler Mullen from the MediaPipe team

MediaPipe is a framework for building cross-platform multimodal applied ML pipelines. We have previously demonstrated building and running ML pipelines as MediaPipe graphs on mobile (Android, iOS) and on edge devices like Google Coral. In this article, we are excited to present MediaPipe graphs running live in the web browser, enabled by WebAssembly and accelerated by XNNPack ML Inference Library. By integrating this preview functionality into our web-based Visualizer tool, we provide a playground for quickly iterating over a graph design. Since everything runs directly in the browser, video never leaves the user’s computer and each iteration can be immediately tested on a live webcam stream (and soon, arbitrary video).

Running the MediaPipe face detection example in the Visualizer

Figure 1 shows the running of the MediaPipe face detection example in the Visualizer

MediaPipe Visualizer

MediaPipe Visualizer (see Figure 2) is hosted at viz.mediapipe.dev. MediaPipe graphs can be inspected by pasting graph code into the Editor tab or by uploading that graph file into the Visualizer. A user can pan and zoom into the graphical representation of the graph using the mouse and scroll wheel. The graph will also react to changes made within the editor in real time.

MediaPipe Visualizer hosted at https://viz.mediapipe.dev

Figure 2 MediaPipe Visualizer hosted at https://viz.mediapipe.dev

Demos on MediaPipe Visualizer

We have created several sample Visualizer demos from existing MediaPipe graph examples. These can be seen within the Visualizer by visiting the following addresses in your Chrome browser:

Edge Detection

Face Detection

Hair Segmentation

Hand Tracking

Edge detection
Face detection
Hair segmentation
Hand tracking

Each of these demos can be executed within the browser by clicking on the little running man icon at the top of the editor (it will be greyed out if a non-demo workspace is loaded):

This will open a new tab which will run the current graph (this requires a web-cam).

Implementation Details

In order to maximize portability, we use Emscripten to directly compile all of the necessary C++ code into WebAssembly, which is a special form of low-level assembly code designed specifically for web browsers. At runtime, the web browser creates a virtual machine in which it can execute these instructions very quickly, much faster than traditional JavaScript code.

We also created a simple API for all necessary communications back and forth between JavaScript and C++, to allow us to change and interact with the MediaPipe graph directly from JavaScript. For readers familiar with Android development, you can think of this as a similar process to authoring a C++/Java bridge using the Android NDK.

Finally, we packaged up all the requisite demo assets (ML models and auxiliary text/data files) as individual binary data packages, to be loaded at runtime. And for graphics and rendering, we allow MediaPipe to automatically tap directly into WebGL so that most OpenGL-based calculators can “just work” on the web.

Performance

While executing WebAssembly is generally much faster than pure JavaScript, it is also usually much slower than native C++, so we made several optimizations in order to provide a better user experience. We utilize the GPU for image operations when possible, and opt for using the lightest-weight possible versions of all our ML models (giving up some quality for speed). However, since compute shaders are not widely available for web, we cannot easily make use of TensorFlow Lite GPU machine learning inference, and the resulting CPU inference often ends up being a significant performance bottleneck. So to help alleviate this, we automatically augment our “TfLiteInferenceCalculator” by having it use the XNNPack ML Inference Library, which gives us a 2-3x speedup in most of our applications.

Currently, support for web-based MediaPipe has some important limitations:

  • Only calculators in the demo graphs above may be used
  • The user must edit one of the template graphs; they cannot provide their own from scratch
  • The user cannot add or alter assets
  • The executor for the graph must be single-threaded (i.e. ApplicationThreadExecutor)
  • TensorFlow Lite inference on GPU is not supported

We plan to continue to build upon this new platform to provide developers with much more control, removing many if not all of these limitations (e.g. by allowing for dynamic management of assets). Please follow the MediaPipe tag on the Google Developer blog and Google Developer twitter account. (@googledevs)

Acknowledgements

We would like to thank Marat Dukhan, Chuo-Ling Chang, Jianing Wei, Ming Guang Yong, and Matthias Grundmann for contributing to this blog post.

Object Detection and Tracking using MediaPipe

Posted by Ming Guang Yong, Product Manager for MediaPipe

MediaPipe in 2019

MediaPipe is a framework for building cross platform multimodal applied ML pipelines that consist of fast ML inference, classic computer vision, and media processing (e.g. video decoding). MediaPipe was open sourced at CVPR in June 2019 as v0.5.0. Since our first open source version, we have released various ML pipeline examples like

In this blog, we will introduce another MediaPipe example: Object Detection and Tracking. We first describe our newly released box tracking solution, then we explain how it can be connected with Object Detection to provide an Object Detection and Tracking system.

Box Tracking in MediaPipe

In MediaPipe v0.6.7.1, we are excited to release a box tracking solution, that has been powering real-time tracking in Motion Stills, YouTube’s privacy blur, and Google Lens for several years and that is leveraging classic computer vision approaches. Pairing tracking with ML inference results in valuable and efficient pipelines. In this blog, we pair box tracking with object detection to create an object detection and tracking pipeline. With tracking, this pipeline offers several advantages over running detection per frame:

  • It provides instance based tracking, i.e. the object ID is maintained across frames.
  • Detection does not have to run every frame. This enables running heavier detection models that are more accurate while keeping the pipeline lightweight and real-time on mobile devices.
  • Object localization is temporally consistent with the help of tracking, meaning less jitter is observable across frames.

Our general box tracking solution consumes image frames from a video or camera stream, and starting box positions with timestamps, indicating 2D regions of interest to track, and computes the tracked box positions for each frame. In this specific use case, the starting box positions come from object detection, but the starting position can also be provided manually by the user or another system. Our solution consists of three main components: a motion analysis component, a flow packager component, and a box tracking component. Each component is encapsulated as a MediaPipe calculator, and the box tracking solution as a whole is represented as a MediaPipe subgraph shown below.

Visualization of Tracking State for Each Box

MediaPipe Box Tracking Subgraph

The MotionAnalysis calculator extracts features (e.g. high-gradient corners) across the image, tracks those features over time, classifies them into foreground and background features, and estimates both local motion vectors and the global motion model. The FlowPackager calculator packs the estimated motion metadata into an efficient format. The BoxTracker calculator takes this motion metadata from the FlowPackager calculator and the position of starting boxes, and tracks the boxes over time. Using solely the motion data (without the need for the RGB frames) produced by the MotionAnalysis calculator, the BoxTracker calculator tracks individual objects or regions while discriminating from others. To track an input region, we first use the motion data corresponding to this region and employ iteratively reweighted least squares (IRLS) fitting a parametric model to the region’s weighted motion vectors. Each region has a tracking state including its prior, mean velocity, set of inlier and outlier feature IDs, and the region centroid. See the figure below for a visualization of the tracking state, with green arrows indicating motion vectors of inliers, and red arrows indicating motion vectors of outliers. Note that by only relying on feature IDs we implicitly capture the region’s appearance, since each feature’s patch intensity stays roughly constant over time. Additionally, by decomposing a region’s motion into that of the camera motion and the individual object motion, we can even track featureless regions.

Visualization of Tracking State for Each Box

An advantage of our architecture is that by separating motion analysis into a dedicated MediaPipe calculator and tracking features over the whole image, we enable great flexibility and constant computation independent of the number of regions tracked! By not having to rely on the RGB frames during tracking, our tracking solution provides the flexibility to cache the metadata across a batch of frame. Caching enables tracking of regions both backwards and forwards in time; or even sync directly to a specified timestamp for tracking with random access.

Object Detection and Tracking

A MediaPipe example graph for object detection and tracking is shown below. It consists of 4 compute nodes: a PacketResampler calculator, an ObjectDetection subgraph released previously in the MediaPipe object detection example, an ObjectTracking subgraph that wraps around the BoxTracking subgraph discussed above, and a Renderer subgraph that draws the visualization.

MediaPipe Example Graph for Object Detection and Tracking. Boxes in purple are subgraphs.

In general, the ObjectDetection subgraph (which performs ML model inference internally) runs only upon request, e.g. at an arbitrary frame rate or triggered by specific signals. More specifically, in this example PacketResampler temporally subsamples the incoming video frames to 0.5 fps before they are passed into ObjectDetection. This frame rate can be configured differently as an option in PacketResampler.

The ObjectTracking subgraph runs in real-time on every incoming frame to track the detected objects. It expands the BoxTracking subgraph described above with additional functionality: when new detections arrive it uses IoU (Intersection over Union) to associate the current tracked objects/boxes with new detections to remove obsolete or duplicated boxes.

A sample result of this object detection and tracking example can be found below. The left image is the result of running object detection per frame. The right image is the result of running object detection and tracking. Note that the result with tracking is much more stable with less temporal jitter. It also maintains object IDs across frames.

Comparison Between Object Detection Per Frame and Object Detection and Tracking

Follow MediaPipe

This is our first Google Developer blog post for MediaPipe. We look forward to publishing new blog posts related to new MediaPipe ML pipeline examples and features. Please follow the MediaPipe tag on the Google Developer blog and Google Developer twitter account (@googledevs)

Acknowledgements

We would like to thank Fan Zhang, Genzhi Ye, Jiuqiang Tang, Jianing Wei, Chuo-Ling Chang, Ming Guang Yong, and Matthias Grundman for building the object detection and tracking solution in MediaPipe and contributing to this blog post.