tryolabs / norfair

Lightweight Python library for adding real-time multi-object tracking to any detector.

Home Page:https://tryolabs.github.io/norfair/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

MotionEstimator with Reid

utility-aagrawal opened this issue · comments

Hi,

I am using MotionEstimator with reid for my use case "face tracking". I referred to the example in the camera_motion.py and here's my code:

motion_estimator = MotionEstimator()
for i, cv2_frame in enumerate(video):
    if i % skip_period == 0:
        retinaface_detections = detect_faces(cv2_frame)
        detections = retinaface_detections_to_norfair_detections(
            retinaface_detections, track_points=track_points
        )

        frame = cv2_frame.copy()
        coord_transformation = motion_estimator.update(frame)
        for detection in detections:
            cut = get_cutout(detection.points, frame)
            if cut.shape[0] > 0 and cut.shape[1] > 0:
                detection.embedding = DeepFace.represent(img_path = cut, model_name = embed_model, enforce_detection = False, detector_backend = "retinaface")[0]["embedding"]#get_hist(cut) # Set embedding of a detection here..
            else:
                detection.embedding = None

        tracked_objects = tracker.update(detections=detections, period=skip_period, coord_transformations=coord_transformation)
    else:
        tracked_objects = tracker.update()

I haven't really looked at what MotionEstomator does but wanted to quickly check if this looks alright to you. I think my question is that I am doing some kind of transformation for camera motion but when I am cutting out detections, I am not really doing any transformations. Is that okay? Let me know if you need further clarifications. Thanks!

Hello! That code seems fine! When cutting out the detections you don't need to worry about the motion estimator, you are just extracting from the current frame the bounding box associated to a detection obtained by running the model on that same frame, so the camera movement doesn't matter at all. The way you are generating the embeddings is perfectly fine.

Now what I will say next is optional, but might improve the coord_transformation variable returned by the motion_estimator. If you want you can also try to mask the detections or the tracked_objects and provide that mask to the MotionEstimator.update method. Basically create a mask with the same width and height of the frame (with only one channel), that is 1 everywhere except in the detections (or the tracked objects) where it's 0.

mask = np.ones(frame.shape[:2], frame.dtype)
for d in detections:
    bbox = d.points.astype(int)
    mask[bbox[0, 1] : bbox[1, 1], bbox[0, 0] : bbox[1, 0]] = 0

So the whole code would look something like this:

motion_estimator = MotionEstimator()
for i, cv2_frame in enumerate(video):
    if i % skip_period == 0:
        retinaface_detections = detect_faces(cv2_frame)
        detections = retinaface_detections_to_norfair_detections(
            retinaface_detections, track_points=track_points
        )

        frame = cv2_frame.copy()
       
        # here I am generating the mask from the detections (you can also use the tracked_object if you want)
        mask = np.ones(frame.shape[:2], frame.dtype)
        for d in detections:
            bbox = d.points.astype(int)
            mask[bbox[0, 1] : bbox[1, 1], bbox[0, 0] : bbox[1, 0]] = 0

        # here I am passing that mask to the motion estimator
        coord_transformation = motion_estimator.update(frame, mask)

        for detection in detections:
            cut = get_cutout(detection.points, frame)
            if cut.shape[0] > 0 and cut.shape[1] > 0:
                detection.embedding = DeepFace.represent(img_path = cut, model_name = embed_model, enforce_detection = False, detector_backend = "retinaface")[0]["embedding"]#get_hist(cut) # Set embedding of a detection here..
            else:
                detection.embedding = None

        tracked_objects = tracker.update(detections=detections, period=skip_period, coord_transformations=coord_transformation)
    else:
        tracked_objects = tracker.update()

The reason for this is that the MotionEstimator instance tries to estimate the movement of the camera based on the movement of a few randomly chosen pixels. Ideally, those pixels should be picked from the background, since those are objects that only move due to the movement of the camera (for example, a wall, a table, the corner of a room, etc.), and it is better to avoid picking objects that have an intrinsic movement (for example in this case, the faces, since people can move their face independently of the motion of the camera). That is what we do by providing the mask, we are telling the MotionEstimator where to look (i.e: don't look at the movement of the pixels inside a detection).

Of course this is just a suggestion of something you might want to try and see if it works better. Bear in mind that I haven't tried the code I have written in this example, so tell me if you run into any problem with that.

@aguscas , You are the best! Makes sense. I'll try that and thanks a lot for your help.