Here we will employ the MobileNet object detector to find people in a video sequence. After running the code, we note that the detections are not perfect.
Object detectors are usually applied to video streams from various cameras. Sometimes you perform object detection in post-processing, where you are given the complete video file and must look for specific objects. In this article, we will start by doing just that. Then, we will see how to filter out the detection results to show only people. We will achieve the results shown in the figures below (note the figure on the right does not detect a bicycle).
For object detection, we use TensorFlow and MobileNet models. The video sequence comes from this link. All the companion code is here.


Reading a Video File
To read the video file, I created a VideoReader class (see video_reader.py in the Part_05 folder). Internally, this class employs OpenCV's VideoCapture. My use of VideoCapture is quite similar to the previous case, back when we were reading frames from the camera. The major difference is that I need to pass the file path to the VideoCapture initializer:
def __init__(self, file_path):
try:
self.video_capture = opencv.VideoCapture(file_path)
except expression as identifier:
print(identifier)
Then, I read the consecutive frames from the file by invoking the read method of the VideoCapture class instance:
def read_next_frame(self):
(capture_status, frame) = self.video_capture.read()
if(capture_status):
return frame
else:
return None
To use the VideoReader class, first invoke the initializer to provide the input video file, then invoke the read_next_frame method as many times as needed to read the frames. When the method reaches the end of the file, it will return None.
People Detection
To detect people, I started with the modules I created previously, including the Inference and ImageHelper classes. We will reference them in main.py. The source code of those modules is included in the Part_03 folder and explained in a previous article.
To reference the modules, I supplemented the main.py file with the following statements, assuming that the main script is executed from the Part_05 folder:
import sys
sys.path.insert(1, '../Part_03/')
from inference import Inference as model
from image_helper import ImageHelper as imgHelper
Consequently, we can easily access the object detection on the frames of the video file:
model_file_path = '../Models/01_model.tflite'
labels_file_path = '../Models/02_labels.txt'
ai_model = model(model_file_path, labels_file_path)
video_file_path = '../Videos/01.mp4'
video_reader = videoReader(video_file_path)
frame = video_reader.read_next_frame()
score_threshold = 0.5
results = ai_model.detect_objects(frame, score_threshold)
However, the problem is that we detect all the objects that the model was trained for. To detect only people, we need to filter the results returned by the detect_objects method. For filtering purposes, we use the label of the detected object. The filtering method can be implemented as follows:
def detect_people(self, image, threshold):
all_objects = self.detect_objects(image, threshold)
people = filter(lambda r: r['label'] == 'person', all_objects)
return list(people)
I added the above method, detect_people, to the Inference class (see inference.py in the Part_03 folder). A detect_people function internally invokes detect_objects and then filters the results using filter, a built-in Python function. The first parameter is the filtering method. Here, I use an anonymous lambda function that returns a boolean value. It is True when the label of the current detection result is "person", and False otherwise.
Displaying Detection Results
To display the detected people, I use the static display_image_with_detected_objects method from the image_helper module. However, the display_image_with_detected_objects method was intended to display the image until the user presses a key. If I use it for a video sequence, the user would need to press the key for each frame. To adapt it for videos, I modified the method by adding another parameter: delay. I pass the value of this parameter to OpenCV's waitKey method to impose the waiting timeout:
@staticmethod
def display_image_with_detected_objects(image, inference_results, delay = 0):
opencv.namedWindow(common.WINDOW_NAME, opencv.WINDOW_GUI_NORMAL)
for i in range(len(inference_results)):
current_result = inference_results[i]
ImageHelper.draw_rectangle_and_label(image,
current_result['rectangle'], current_result['label'])
opencv.imshow(common.WINDOW_NAME, image)
opencv.waitKey(delay)
By default, the delay is 0, so the method will still work with calls to it that expect it to wait for a keypress.
Putting Things Together
With all the components ready, we can put them together:
import sys
sys.path.insert(1, '../Part_03/')
from inference import Inference as model
from image_helper import ImageHelper as imgHelper
from video_reader import VideoReader as videoReader
if __name__ == "__main__":
model_file_path = '../Models/01_model.tflite'
labels_file_path = '../Models/02_labels.txt'
ai_model = model(model_file_path, labels_file_path)
video_file_path = '../Videos/01.mp4'
video_reader = videoReader(video_file_path)
score_threshold = 0.4
detect_only_people = False
delay_between_frames = 5
while(True):
frame = video_reader.read_next_frame()
if(frame is None):
break
if(detect_only_people):
results = ai_model.detect_people(frame, score_threshold)
else:
results = ai_model.detect_objects(frame, score_threshold)
imgHelper.display_image_with_detected_objects(frame, results, delay_between_frames)
There are two switches here to control the script execution. First, there's the detect_only_people variable, which controls whether the script detects all objects (False) or only people (True). Second, there's the delay_between_frames variable, which controls the delay between frames and therefore the speed of a result's preview. By default, I set it to 5 ms.
Wrapping Up
In this article, we employed the MobileNet object detector to find people in a video sequence. After running the code, we note that the detections are not perfect. Some people are not recognized. This is not improved even if the detection score is reduced. We will deal with this problem later by using a more robust object detection. But first, we will learn how to calculate the distance between people in images to check if they are too close.
Dawid Borycki is a software engineer and biomedical researcher with extensive experience in Microsoft technologies. He has completed a broad range of challenging projects involving the development of software for device prototypes (mostly medical equipment), embedded device interfacing, and desktop and mobile programming. Borycki is an author of two Microsoft Press books: “Programming for Mixed Reality (2018)” and “Programming for the Internet of Things (2017).”