Here we look at: Setting up BodyPix, detecting face touches, how I wrote my predictImage() function from the starting point template, using the distance formula to check for face region overlap, and how we can use BodyPix to estimate a person’s body poses.
TensorFlow + JavaScript. The most popular, cutting-edge AI framework now supports the most widely used programming language on the planet, so let’s make magic happen through deep learning right in our web browser, GPU-accelerated via WebGL using TensorFlow.js!
In the previous paper, we trained an AI with TensorFlow.js to simulate the donottouchyourface.com app, which was designed to help people reduce the risk of getting sick by learning to stop touching their face. In this article, we are going to use BodyPix, a body part detection and segmentation library, to try and remove the training step of the face touch detection.
Starting Point
For this project, we need to:
- Import TensorFlow.js and BodyPix
- Add the video element
- Add a canvas for debugging
- Add a text element for Touch vs No Touch status
- Add the webcam setup functionality
- Run the model prediction every 200 ms instead of picking an image, but only after the model has trained for the first time
Here is our starting point:
<html>
<head>
<title>Face Touch Detection with TensorFlow.js Part 2: Using BodyPix</title>
<script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs@2.0.0/dist/tf.min.js"></script>
<script src="https://cdn.jsdelivr.net/npm/@tensorflow-models/body-pix@2.0"></script>
<style>
img, video {
object-fit: cover;
}
</style>
</head>
<body>
<video autoplay playsinline muted id="webcam" width="224" height="224"></video>
<canvas id="canvas" width="224" height="224"></canvas>
<h1 id="status">Loading...</h1>
<script>
async function setupWebcam() {
return new Promise( ( resolve, reject ) => {
const webcamElement = document.getElementById( "webcam" );
const navigatorAny = navigator;
navigator.getUserMedia = navigator.getUserMedia ||
navigatorAny.webkitGetUserMedia || navigatorAny.mozGetUserMedia ||
navigatorAny.msGetUserMedia;
if( navigator.getUserMedia ) {
navigator.getUserMedia( { video: true },
stream => {
webcamElement.srcObject = stream;
webcamElement.addEventListener( 'loadeddata', resolve, false );
},
error => reject());
}
else {
reject();
}
});
}
(async () => {
await setupWebcam();
setInterval( predictImage, 200 );
})();
async function predictImage() {
}
</script>
</body>
</html>
Setting Up BodyPix
BodyPix takes several parameters when loading – you might recognize some of them. It supports two different pre-trained models for its architecture: MobileNetV1 and ResNet50. The required parameters may vary depending on the model you chose. We will use MobileNet for and initialize BodyPix with the following code:
(async () => {
model = await bodyPix.load({
architecture: 'MobileNetV1',
outputStride: 16,
multiplier: 0.50,
quantBytes: 2
});
await setupWebcam();
setInterval( predictImage, 200 );
})();
Detecting Face Touches
With body part segmentation, we get two pieces of data from BodyPix:
- Key points of body parts, such as nose, ears, wrist, elbow, etc., represented in 2-D screen pixel coordinates
- The 2-D segmentation pixel data stored in a 1-D array format
After brief testing, I found that the key point coordinates retrieved for the nose and ears were fairly reliable while the points for a person’s wrists were not accurate enough to determine whether a hand is touching the face. Therefore, we will use the segmentation pixels to determine face touch.
Because the nose and ears key points seem reliable, we can use them to estimate a circle region for the person’s face. Using this circle region, we can determine if any left-hand or right-hand segmentation pixels overlap the area – and mark the status as a face touch.
Here’s how I wrote my predictImage()
function from the starting point template, using the distance formula to check for face region overlap:
async function predictImage() {
const img = document.getElementById( "webcam" );
const segmentation = await model.segmentPersonParts( img );
if( segmentation.allPoses.length > 0 ) {
const keypoints = segmentation.allPoses[ 0 ].keypoints;
const nose = keypoints[ 0 ].position;
const earL = keypoints[ 3 ].position;
const earR = keypoints[ 4 ].position;
const earLtoNose = Math.sqrt( Math.pow( nose.x - earL.x, 2 ) + Math.pow( nose.y - earL.y, 2 ) );
const earRtoNose = Math.sqrt( Math.pow( nose.x - earR.x, 2 ) + Math.pow( nose.y - earR.y, 2 ) );
const faceRadius = Math.max( earLtoNose, earRtoNose );
// Check if any of the left_hand(10) or right_hand(11) pixels are within the nose to faceRadius
let isTouchingFace = false;
for( let y = 0; y < 224; y++ ) {
for( let x = 0; x < 224; x++ ) {
if( segmentation.data[ y * 224 + x ] === 10 ||
segmentation.data[ y * 224 + x ] === 11 ) {
const distToNose = Math.sqrt( Math.pow( nose.x - x, 2 ) + Math.pow( nose.y - y, 2 ) );
// console.log( distToNose );
if( distToNose < faceRadius ) {
isTouchingFace = true;
break;
}
}
}
if( isTouchingFace ) {
break;
}
}
if( isTouchingFace ) {
document.getElementById( "status" ).innerText = "Touch";
}
else {
document.getElementById( "status" ).innerText = "Not Touch";
}
// --- Uncomment the following to view the BodyPix mask ---
// const canvas = document.getElementById( "canvas" );
// bodyPix.drawMask(
// canvas, img,
// bodyPix.toColoredPartMask( segmentation ),
// 0.7,
// 0,
// false
// );
}
}
If you would like to see the pixels predicted by BodyPix, you can uncomment the bottom section of the function.
My approach to predictImage()
is a very rough estimate that uses the hand pixel’s proximity. A fun challenge for you might be to find a more accurate way to detect when a person’s hand has touched the face!
Technical Footnotes
- One advantage of using BodyPix for Face Touch Detection is that the user does not need to train an AI with examples of the undesired behavior
- Another advantage of BodyPix is that it can segment the face in front when the person’s hand is hidden behind it.
- This approach and prediction are more specific to recognizing a Face Touch action than what we used in the previous article; however, the first approach may result in more accurate predictions given enough sample data
- Expect performance issues as BodyPix is computationally expensive
Finish Line
For your reference, here is the full code for this project:
<html>
<head>
<title>Face Touch Detection with TensorFlow.js Part 2: Using BodyPix</title>
<script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs@2.0.0/dist/tf.min.js"></script>
<script src="https://cdn.jsdelivr.net/npm/@tensorflow-models/body-pix@2.0"></script>
<style>
img, video {
object-fit: cover;
}
</style>
</head>
<body>
<video autoplay playsinline muted id="webcam" width="224" height="224"></video>
<canvas id="canvas" width="224" height="224"></canvas>
<h1 id="status">Loading...</h1>
<script>
async function setupWebcam() {
return new Promise( ( resolve, reject ) => {
const webcamElement = document.getElementById( "webcam" );
const navigatorAny = navigator;
navigator.getUserMedia = navigator.getUserMedia ||
navigatorAny.webkitGetUserMedia || navigatorAny.mozGetUserMedia ||
navigatorAny.msGetUserMedia;
if( navigator.getUserMedia ) {
navigator.getUserMedia( { video: true },
stream => {
webcamElement.srcObject = stream;
webcamElement.addEventListener( 'loadeddata', resolve, false );
},
error => reject());
}
else {
reject();
}
});
}
let model = null;
(async () => {
model = await bodyPix.load({
architecture: 'MobileNetV1',
outputStride: 16,
multiplier: 0.50,
quantBytes: 2
});
await setupWebcam();
setInterval( predictImage, 200 );
})();
async function predictImage() {
const img = document.getElementById( "webcam" );
const segmentation = await model.segmentPersonParts( img );
if( segmentation.allPoses.length > 0 ) {
const keypoints = segmentation.allPoses[ 0 ].keypoints;
const nose = keypoints[ 0 ].position;
const earL = keypoints[ 3 ].position;
const earR = keypoints[ 4 ].position;
const earLtoNose = Math.sqrt( Math.pow( nose.x - earL.x, 2 ) + Math.pow( nose.y - earL.y, 2 ) );
const earRtoNose = Math.sqrt( Math.pow( nose.x - earR.x, 2 ) + Math.pow( nose.y - earR.y, 2 ) );
const faceRadius = Math.max( earLtoNose, earRtoNose );
let isTouchingFace = false;
for( let y = 0; y < 224; y++ ) {
for( let x = 0; x < 224; x++ ) {
if( segmentation.data[ y * 224 + x ] === 10 ||
segmentation.data[ y * 224 + x ] === 11 ) {
const distToNose = Math.sqrt( Math.pow( nose.x - x, 2 ) + Math.pow( nose.y - y, 2 ) );
if( distToNose < faceRadius ) {
isTouchingFace = true;
break;
}
}
}
if( isTouchingFace ) {
break;
}
}
if( isTouchingFace ) {
document.getElementById( "status" ).innerText = "Touch";
}
else {
document.getElementById( "status" ).innerText = "Not Touch";
}
}
}
</script>
</body>
</html>
What’s Next? Can We Do Even More With TensorFlow.js?
In this project, we saw how easily we can use BodyPix to estimate a person’s body poses. For the next project, let’s revisit the webcam transfer learning and have a bit of fun with it.
Follow along with the next article in this series to see if we can train an AI to deep-learn some hand gestures and sign language.
Raphael Mun is a tech entrepreneur and educator who has been developing software professionally for over 20 years. He currently runs Lemmino, Inc and teaches and entertains through his Instafluff livestreams on Twitch building open source projects with his community.