facebookresearch / co-tracker

CoTracker is a model for tracking any point (pixel) on a video.

Home Page:https://co-tracker.github.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The result of tracking point from the middle of video is not precise

ernestchu opened this issue · comments

Hi, thanks for your great work. When I tried you notebook demo. There's some ambiguities when tracking manually selected points.

queries = torch.tensor([
    [0., 400., 350.],  # point tracked from the first frame
    [10., 600., 500.], # frame number 10
    [20., 750., 600.], # ...
    [30., 900., 200.]
])

Unknown-3

Let's say we are interesting in queries[1], which is the index to a point in the 10th frame, so the model should output a trajectory of all (0, 0) and visibility of False from 0 to 9 timestamps. However, when inspecting pred_visibility, the expected behavior only presents at the first four timestamps. (same problem also happens to pred_tracks)

pred_visibility[:, :, 1]

tensor([[False, False, False, False,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         False, False, False,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True]], device='cuda:0')

Why is that? Thanks!

Hi @ernestchu, thank you for your question!
The model works with sliding windows. As soon as the frame of interest (in this case, the 10th frame) falls within a particular sliding window, the model begins providing visibility predictions for that point throughout the entire window. The sliding window has a size of 8 frames with an overlap of 4 frames, so the frame number 10 falls within the second sliding window. This explains why the visibility is set to "False" only for the first four timestamps in this case (the same is true for trajectories). You can simply discard these predictions if you don't need them.

Thanks for your detailed response!