Understanding the Real Time Person Removal project – TensorFlow.js

Most of us interested in TensorFlow.js would be familiar with the popular demo of disappearing people in real time by Jason Mayes. This is a fantastic example of what is possible using TensorFlow.js in the browser. Jason is gracious to share the code of this application for others to learn and implement their own versions of such applications. 

In this article, I will attempt to document and walkthrough this code, so that it helps anyone who is trying to understand the code for this app.

At a high level, this app uses the TensorFlow.js pre-trained BodyPix model to identify the person in the image (using the pixel wise classification output) and remove it real time from the screen. 

Let’s dive into the code now.

Index.html file

Let’s discuss the two key things in the index.html file.

1. Load the JavaScript packages required for this app

This app requires the TensorFlow.js library and the BodyPix package which will be used to identify the pixels that correspond to a person in the image. 

The script.js file is the JavaScript file where the code for this app resides.

2. The video element

There is a Div element (id=liveview) defined in the index.html that wraps the video element. The video element is going to display the live video from the webcam which would be then used to process the frames through the BodyPix model.

We are now ready to walkthrough the code in script.js file where all the action happens.

Script.js file

Let’s try to put this code together step by step to understand in detail.

1. Declare the required objects and configs

To start with, declare the variables and the configurations required as input for the BodyPix model. The object “bodyPixProperties” contains parameters that will be provided as input while loading the BodyPix model and the object “segmentationProperties” holds the parameters that will be provided as input during object detection. Please refer to the BodyPix model’s github page to get a better understanding of the configurations in these two objects.

2. Load the BodyPix model

The object bodyPix is available because we loaded the body-pix package using script tag in the index.html file. Invoke the load method by passing the “bodyPixProperties” object as input. This method returns a promise which resolves to the body-pix model object. Once the model is loaded, the variable “modelHasLoaded” is set to true and the “section” element which wraps the Div element (liveview) is made visible which displays the “Enable Webcam” button on the screen.

3. Declare the canvas elements

There are two canvas elements created

  • The “videoRenderCanvas” is used here as an in-memory canvas that will temporarily hold the video frames during classification of the image on each frame. This canvas is not added to the DOM and will not be displayed on the screen. 
  • The “webcamCanvas” element is the one that is used to render the output frames after removing the person. This is the one displayed at the bottom of the screen (below the live video) where the person is invisible. 

Please note that there is a third canvas element named “bodyPixCanvas” that is commented as it is not used.

4. Add event listener to the “Enable Webcam” button

Check if webcam is supported and then add an event listener to the “Enable Webcam” button’s “click” event by passing the “enableCam()” callback function.

5. enableCam() function

Let’s break down the code in this function and look at them in detail.

5.1 Check BodyPix model is loaded

  • If the BodyPix model is not loaded, then exit from the function. 
  • Hide the “Enable Webcam” button after it is clicked once as the video will start playing and classification will be done continuously on the live feed.

5.2 Activate the webcam

Declare the parameters that will be passed to the getUserMedia() function. Since for this app, we only require the video feed, the video parameter alone is set to “true”.

Invoke the getUserMedia() function which returns a Promise that resolves to the live webcam video stream. 

Add an event listener for the “loadedmetadata” event. In the callback function of this event listener, first update the width and height of the two canvas elements that we declared earlier to that of the video elements width and height. The width and height for the video and canvas elements should match for the BodyPix model to process the frame and output the results accurately. 

Display the first frame of the video in the “webcamCanvas” which will be displayed below the live video on the screen.

Assign the video stream as the source for the video element, so that the live view from the webcam is displayed on the screen.

Finally add an event listener to the “loadeddata” event of the video object, and pass the “predictWebcam()” callback function.

6. predictWebcam() function

This function will be repeatedly invoked and each frame will be passed to the BodyPix model for classification and the output of the model will be then processed to remove the person from the frame. 

Copy the video frame to the “videoRenderCanvas” which is present only in-memory and not displayed on the screen.

Invoke “model.segmentPerson()” and pass the “videoRenderCanvas” which holds the current frame of the video and the “segmentationProperties” object that was declared earlier. This method returns the Promise that resolves to the model’s output object – “segmentation”. Once the output from the model is available, invoke the “processSegmentation” function to process the model’s output and remove the person from the image. The processed image will then be displayed in the “webcamCanvas” element.

Pass this function to the “window.requestAnimationFrame()” method to invoke this function repeatedly for each frame in the video.

7. processSegmentation() function

The BodyPix model’s output object contains a “width”, “height”, “allPoses” array and a “data” array of 0s and 1s for every pixel in the image. The pixels that are part of a person’s body part are denoted by 1s in the “data” array, otherwise 0s.

{
  width: 640,
  height: 480,
  data: Uint8Array(307200) [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, …],
  allPoses: [{"score": 0.4, "keypoints": […]}, …]
}

The “webcamCanvas” element at this point will be holding the previous frame image that was drawn on to it. The video element will hold the live frame from the webcam. We have the pixel wise classification output on the live frame currently. 

In case, the model’s output detects a person’s body, then the idea is to loop through all the pixels in the image, and update only those pixels of the “webcamCanvas” element with that of the live image data where the data array contains a 0 (person not found) for that pixel index. Thereby we update the “webcamCanvas” element frame with live data in all pixels other than the ones where a person’s body is identified. The area where a person’s body is found, is left without updating, so that it can hold the previous frame data.

In the event, the model’s output doesn’t detect a person’s body, then update all the pixels of the “webcamCanvas” element with that of the live image data from the video element.

Let’s look at the code in detail.

7.1 Get the images data

The “webcamCanvas” element currently holds the previous frame image. Get that image by invoking the ctx.getImageData() function. 

The “videoRenderCanvas” element currently holds the live video frame. Get that image by invoking the videoRenderCanvasCtx.getImageData() function.

Get the data property (array) from the ImageData of both “webcamCanvas” and “videoRenderCanvas” elements.

7.2 Identify the bounding box coordinates for the person’s body

As a next step, the rectangular bounding box coordinates for the person’s body in the image is identified. Loop through all pixels of the image and calculate the bounding box coordinates based on those pixel indexes where a 1 is present in the “data” array of the model’s output.

In the below code, the first for loop loops through for the entire width of the image, and the inner for loop, loops through the entire height of the image. So, for each x value, the entire y value is looped. The “n” variable calculates the index position of the pixel or in other words the nth pixel that is being looked at in the image. Look at the below image, which depicts the loop, pictorially.

7.3 Calculate the dimensions of the bounding box

Get the width and height of the bounding box, by subtracting the X and Y values between min and max coordinates. 

Increase the width and height by a factor of 1.3 to account for the false negatives around the region where a body is identified. Now since the width and height are scaled up, recalculate the X and Y coordinates so that the bounding box is equally adjusted with respect to the center.

7.4 Update the background of the canvas outside of the bounding box

Loop through all the pixels and if there is a person’s body identified, then update only those pixels in the “webcamCanvas” element with the live video data, which are outside of the bounding box coordinates. 

If there is no person’s body identified in this current frame, then update all the pixels of “webcamCanvas” element with that of the live video data. 

7.4.1 How the image data is updated

The “ImageData” object returned by getImageData() function contains information about every pixel of the image. The “data” property of the “ImageData” object contains an array of color/alpha information for every pixel in the image.  

Please refer to the link to understand more about getImageData() function and its data property.

To quote from that link,
“For every pixel in an ImageData object there are four pieces of information, the RGBA values:

R - The color red (from 0-255)
G - The color green (from 0-255)
B - The color blue (from 0-255)
A - The alpha channel (from 0-255; 0 is transparent and 255 is fully visible)

The color/alpha information is held in an array, and is stored in the data property of the ImageData object.”

The “data” property of “ImageData” will contain an array of size width * height * 4, where width and height are the dimensions of the image. 

So, for every pixel, 4 items in the “ImageData.data” array are updated corresponding to the color/alpha values corresponding to that pixel.

Finally draw the ImageData on the “webcamCanvas” element by invoking the putImageData() function on its context.

That is it, hope this helps in understanding how the app works a little quicker. Please leave your comments below, if you have any questions.

References

  1. Github repository for the Real-Time-Person-Removal project by Jason Mayes.
  2. Github repository for the TensorFlow.js pre-trained BodyPix model.
  3. HTML canvas getImageData() method from W3Schools.
  4. Real-time Human Pose Estimation in the Browser with TensorFlow.js using PoseNet model.

Leave a Comment

Your email address will not be published. Required fields are marked *