In the ever-evolving realm of computer vision and artificial intelligence, object tracking is a pivotal concept with diverse applications, from autonomous vehicles to surveillance systems. YOLOv8, a state-of-the-art real-time object detection framework, has gained significant attention. In this blog post, we explore the world of YOLOv8 object tracking, showcasing its capabilities and adding intelligence by analyzing tracked object statistics.
Our Python-based project harnesses YOLOv8's power for highly accurate real-time object tracking. But we go a step further by examining tracked objects' movements, measuring distances traveled in pixels, and calculating average speeds. This approach offers a comprehensive understanding of how these objects behave in their environment.
Whether you're a computer vision enthusiast, a developer looking to add object tracking to your applications, or someone intrigued by AI, this post aims to inspire and educate. We dive into YOLOv8's potential, technical intricacies of object tracking, and how to gain insights into tracked object motion. By the end, you'll have the knowledge and tools to implement your object tracking solutions and a deeper understanding of the dynamic world within your videos and images.
For our object tracking project, we rely on two main libraries:
-
OpenCV: Used for opening video streams, frame drawing, and more. It's a versatile open-source software library for computer vision and image processing, making it valuable for object detection, facial recognition, image stitching, and motion tracking. OpenCV's popularity stems from its efficiency, ease of use, and extensive community support, making it a preferred choice in various fields, including robotics, machine learning, and computer vision.
-
Ultralytics: An open-source software framework focused on computer vision and deep learning. It streamlines the development of object detection, image classification, and other machine learning tasks. Ultralytics is popular for its user-friendly nature, comprehensive documentation, and seamless integration with PyTorch, a leading deep learning framework. It simplifies complex tasks like training and deploying neural networks for applications like autonomous vehicles, surveillance systems, and medical image analysis, earning recognition as an essential resource in the deep learning and computer vision community.
We also utilize supporting libraries like numpy, and you can find a complete list of requirements here.
Without further adieu, let's begin building our system. Our system will consist of two main building blocks:
- An object detection model which will accept a frame and perform object detection and tracking on the given frame
- A main loop which will fetch frames from a video stream, feed them into our detection system defined above, annotate the frames, and show them to our user
Let's start with our object detection model, which is defined in detector.py. As is customary with any python project, let's import our required libraries:
# For machine learning
import torch
# For array computations
import numpy as np
# For image decoding / editing
import cv2
# For environment variables
import os
# For detecting which ML Devices we can use
import platform
# For actually using the YOLO models
from ultralytics import YOLO
As discussed above, the two most important libraries we will be using are openCV and ultralytics. Numpy will also be used pretty heavily for array type operations (as images are really just arrays of pixel values).
We can then jump into our class definition, __init__
, and some supporting functions:
class YoloV8ImageObjectDetection:
def __init__(self, model_path="yolov8n.pt", conf_threshold=0.50):
"""Initializes a yolov8 detector
Arguments:
model_path (str): A path to a pretrained model file or one on torchub
conf_threshold (float): Confidence threshold for detections
Default Model Supports The Following:
{
0: 'person', 1: 'bicycle', 2: 'car', 3: 'motorcycle',
4: 'airplane', 5: 'bus', 6: 'train', 7: 'truck',
8: 'boat', 9: 'traffic light', 10: 'fire hydrant',
11: 'stop sign', 12: 'parking meter', 13: 'bench',
14: 'bird', 15: 'cat', 16: 'dog', 17: 'horse',
18: 'sheep', 19: 'cow', 20: 'elephant', 21: 'bear',
22: 'zebra', 23: 'giraffe', 24: 'backpack', 25: 'umbrella',
26: 'handbag', 27: 'tie', 28: 'suitcase', 29: 'frisbee',
30: 'skis', 31: 'snowboard', 32: 'sports ball', 33: 'kite',
34: 'baseball bat', 35: 'baseball glove', 36: 'skateboard',
37: 'surfboard', 38: 'tennis racket', 39: 'bottle', 40: 'wine glass',
41: 'cup', 42: 'fork', 43: 'knife', 44: 'spoon', 45: 'bowl', 46: 'banana',
47: 'apple', 48: 'sandwich', 49: 'orange', 50: 'broccoli', 51: 'carrot',
52: 'hot dog', 53: 'pizza', 54: 'donut', 55: 'cake', 56: 'chair',
57: 'couch', 58: 'potted plant', 59: 'bed', 60: 'dining table',
61: 'toilet', 62: 'tv', 63: 'laptop', 64: 'mouse', 65: 'remote',
66: 'keyboard', 67: 'cell phone', 68: 'microwave', 69: 'oven',
70: 'toaster', 71: 'sink', 72: 'refrigerator', 73: 'book', 74: 'clock',
75: 'vase', 76: 'scissors', 77: 'teddy bear', 78: 'hair drier',
79: 'toothbrush'
}
"""
self.conf_threshold = conf_threshold
self.model = self._load_model(model_path)
self.device = self._get_device()
self.classes = self.model.names
def _load_model(self, model_path):
"""Loads Yolo8 model from pytorch hub or a path on disk
Arguments:
model_path (str): A path to a pretrained model file or one on torchub
Returns:
model (Model) - Trained Pytorch model
"""
model = YOLO(model_path)
return model
def _get_device(self):
"""Gets best device for your system
Returns:
device (str): The device to use for YOLO for your system
"""
if platform.system().lower() == "darwin":
return "mps"
if torch.cuda.is_available():
return "cuda"
return "cpu"
Our __init__
function takes two parameters:
model_path
- This will be the path to a pretrained model (checkpoint file). If you use the default, we will just load the pretrained one from torchhubconf_threshold
- This will be our confidence threshold for detections. For example, if our confidence is 0.5, it means that our model will only show and annotate detections in an image that have a 50% or higher confidence. Anything lower will be ignored.
Our __init__
function then loads our model by instantiating a new YOLO
object
with the model_path
parameter. It then uses platform detection to see if either mps
or cuda
are available on your system. Either of those will be much faster than the default
cpu
. As we exit our __init__
our model has been loaded, our confidence threshold set, and
our class names defined. Now we are ready to move onward.
To perform detections, our detector has three other methods:
is_detectable
- Sees if a requested class is detectable by our modelclassname_to_id
- Translates a string classname to its integer IDdetect
- Performs object tracking and detection
is_detectable
and classname_to_id
are helper functions, and we will omit
them from this discussion because they are relatively simple. detect
, on the
other hand, is shown in full below:
def detect(self, frame, classname):
"""Analyze a frame using a YOLOv8 model to find any of the classes
in question
Arguments:
frame (numpy.ndarray): The frame to analyze (from cv2)
classname (str): Class name to search our model for
Returns:
plotted (numpy.ndarray): Frame with bounding boxes and labels ploted on it.
boxes (torch.Tensor): A set of bounding boxes
tracks (list): A list of box IDs
"""
looking_for = self.classname_to_id(classname)
results = self.model.track(frame, persist=True, conf=self.conf_threshold, classes = [looking_for])
plotted = results[0].plot()
boxes = results[0].boxes.xywh.cpu()
tracks = results[0].boxes.id.int().cpu().tolist() if results[0].boxes.id else []
return plotted, boxes, tracks
The detect
function takes two parameters, the frame to analyze and a class
to look for. Note that, in our project, we only want to track one object at
a time, hence this argument.
We first translate the string classname to its integer id and then call our
YOLO model using the track()
method. We pass the frame, confidence, and
single class that we are looking for into the track()
method which returns
us a list of YOLO Result objects. We then use the Result.plot()
to plot
the bounding boxes onto the frame, get the x, y, width, and height of the bounding
boxes, and finally get the id's of the bounding boxes. The coordinates of our
bounding boxes will be used to draw our tracks, while the box ids will be used
to keep a record of which track belongs to which box. Finally, we can return
our plotted frame, box coordinates, and box ids (track ids) back to the caller.
We are now ready to use our detector in a main loop or to run through video frames.
Our main loop will do a few things:
- Register an object to track from our user
- Instantiate our object detector
- Open a video input and output stream
- Read frames from our input stream, feed them to our detector
- Show the analyzed frames to the user and write thenm to our output stream
Let's start, as always, with our imports:
# Using defaultdict so we dont have to
# do if key in dict checks
from collections import defaultdict
# For distance calculations
import math
# For opening, reading, and writing video frames
import cv2
# For array operations
import numpy as np
# Our custom detector
from detector import YoloV8ImageObjectDetection
We can then enter our main function:
def main():
to_track = input("What would you like to track? ").strip().lower()
# The YOLOv8 Detection Wrapper We Will Use
# To Analyze Frames
detector = YoloV8ImageObjectDetection()
if (not detector.is_detectable(to_track)):
raise ValueError(f"Error: My detecto does not know how to detect {to_track}!")
The first thing we do is ask the user for an object they would like to track. In this
demo, I'll be using a cell phone
, which is in the default YOLO model. We then do a quick
sanity check to make sure that our model can actually detect the object from the user.
We then open our video streams using OpenCV:
# Create a video capture instance.
# VideoCapture(0) corresponds to your computers
# webcam
cap = cv2.VideoCapture(0)
# Lets grab the frames-per-second (FPS) of the
# webcam so our output has a similar FPS.
# Lets also grab the height and width so our
# output is the same size as the webcam
fps = cap.get(cv2.CAP_PROP_FPS)
frame_width = int(cap.get(3))
frame_height = int(cap.get(4))
# Now lets create the video writer. We will
# write our processed frames to this object
# to create the processed video.
out = cv2.VideoWriter('outpy.avi',
cv2.VideoWriter_fourcc('M','J','P','G'),
fps,
(frame_width,frame_height)
)
cv2.namedWindow('Video')
The cv2.VideoCapture(0)
will open up, on most computers, the onboard
webcam of the current laptop. We will use the webcam as our input stream.
We then just want to grab a few details from the device so that our
input stream and output vide have the same frame rates, sizes, etc.
Now it's time to read, analyze, and write our frames:
# The object tracks we have seen before
track_history = defaultdict(lambda: [])
# Previous frames, a frame count, and distance/speed
# variables
prev = None
count = 1
dist = 0
speed = 0
while(True):
# Capture frame-by-frame
ret, frame = cap.read()
if not ret:
continue
# Use our detector to plot the bounding boxes on the frame,
# give us our bounding boxes, and our object tracks
frame, boxes, track_ids = detector.detect(frame, to_track)
# For each bounding box and track we found,
# We can calculate the box center and draw it and
# the track on the screen. Tracks will be represented
# as polylines created from our track
for box, track_id in zip(boxes, track_ids):
x, y, w, h = box
track = track_history[track_id]
track.append((float(x), float(y))) # x, y center point
if len(track) > 60: # Only hold the most recent 60 tracks
track.pop(0)
# Draw the tracking lines
points = np.hstack(track).astype(np.int32).reshape((-1, 1, 2))
cv2.polylines(frame, [points], isClosed=False, color=(230, 230, 230), thickness=10)
# Add the distance between the previous box center and this box center
# to help us keep track of the total pixel distance
if prev:
dist += math.hypot(float(x)-float(prev[0]), float(y)-float(prev[1]))
# Update our previous pointer
prev = (float(x), float(y))
# Calculate speed as total pixel distance / number of frames so that
# we get average pixels moved / frame
speed = dist / count
count += 1
cv2.putText(frame, f"Distance Covered (pixels): {dist:.5f}", (0, 30), cv2.FONT_HERSHEY_SIMPLEX, 1.0, (0, 0, 0), 4)
cv2.putText(frame, f"Average Speed (pixes/frame): {speed:.5f} ", (0, 60), cv2.FONT_HERSHEY_SIMPLEX, 1.0, (0, 0, 0), 4)
# Write to our output file
out.write(frame)
# Show the frame
cv2.imshow('Video', frame)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
First, we read each frame by calling cap.read()
. This reads a single
frame from our camera device. We then pass that frame and our desired class
into our detector, which returns us an annotated frame, the bounding box
coordinates, and the track ids (box ids).
Now that we have our bounding boxes and ids, we can loop through them together. For
each bounding box we find, we first see if it is in our track_history
dictionary.
If it is, this is not a new track, and we can continue appending our track points to it.
We can then convert our points into a numpy array and then draw them on our annotated frame
with cv2.polylines
.
The last thing we have to do is calculate the distance moved from our last known
track point using math.hypot
and then finally, we can update our prev pointer
to the new last-seen coordinates.
Before exiting this loop iteration, we also want to calculate our total speed, defined
as pixels traveled / number of frames and then show our statistics on the frame
using cv2.putText
.
So, to summarize:
- While running...
- We can read each frame with
cap.read()
- We can pass the frame into our detector with
detector.detect()
- We loop through all of the bounding boxes and draw the track points with
cv2.polylines()
- We calculate our total distances and average speeds
- We write the statistics on the frame with
cv2.putText()
- We can read each frame with
At this point, we have completed the implementation stage of our project and we are ready to run it.
Please make sure to install the necessary requirements first (with either pip, venv, poetry, etc).
After that, it's as simple as running the commands below:
prompt> python yolotracker/main.py
What would you like to track? cell phone
0: 384x640 (no detections), 42.4ms
Speed: 2.0ms preprocess, 42.4ms inference, 0.4ms postprocess per image at shape (1, 3, 384, 640)
0: 384x640 (no detections), 46.4ms
Speed: 1.6ms preprocess, 46.4ms inference, 0.7ms postprocess per image at shape (1, 3, 384, 640)
0: 384x640 (no detections), 45.6ms
Speed: 1.4ms preprocess, 45.6ms inference, 0.6ms postprocess per image at shape (1, 3, 384, 640)
0: 384x640 (no detections), 43.8ms
Speed: 1.6ms preprocess, 43.8ms inference, 0.6ms postprocess per image at shape (1, 3, 384, 640)
You can see an output video below:
You can see that when the phone is in screen, a few things happen:
- Very clearly, we append track points to the screen and track the center point of the bounding box
- Our distance traveled keeps increasing
- Our speed averages keep changing, depending on how fast we're moving
When the phone exits the screen, we see our speed begin to drop because the phone is no longer "moving" to the system. When the phone re-enters the screen, our tracks pick up right where we left off!
Thanks for reading along! Please visit the github repo for all of the code!