pixeli99 / TrackDiffusion

Official PyTorch implementation of TrackDiffusion (https://arxiv.org/abs/2312.00651)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Confusion about the pipeline

Zhaozeyong opened this issue · comments

作者您好,对您的这篇工作非常感兴趣,但是在阅读完论文之后,对于TrackDiffusion的pipeline还是很困惑,不知道是如何从输入到输出的,并且对于论文Fig.2 也是看得迷迷糊糊的,麻烦您有空解惑一下

Hi,
Zeyong,

Thanks for your interest in TrackDiffusion!
The situation is as follows: our model is designed to achieve video generation controlled by bbox, meaning the position of the target in each frame is controlled by a set of box coordinates. Taking the code of our pipeline as an example, in this line, these are the actual parameters passed during the inference process:


We should focus primarily on these three parameters, which are:

bbox_prompt: List[List[float]] = None,
video_masks: torch.FloatTensor = None,
seg_phrases: Union[str, List[str]] = None,
  1. bbox coordinates, used to control the target position
  2. Used to identify whether the target is for padding, as our total number of targets is fixed(20 targets). In some cases, the actual number of targets is less than this number, so padding is needed, and we need to mask off the parts that are padded
  3. Corresponds one-to-one with bbox coordinates, indicating the category of each bbox

The handling of these three parameters mainly focuses on this operation, which is consistent with GLIGEN.

if bbox_prompt is not None:

The above is regarding the input part.

During the forward process, we introduce an additional module called the instance enhancer. To summarize this module in one sentence:

It fuses the features of the same instance across the frame/time dimension and then finds a way to attend to their corresponding areas.

So the next questions are

  1. Where do the features come from?
  2. How to attend?

Q1:
Since we know the specific location of each target in each frame, we can extract the target's feature on the latent based on this location, then perform roi align (to fix the shape), and then perform fusion through the attn layer.
Q2:
In our implementation, this is achieved through an additional cross attn layer, where the query is the unet latent, and the key and value are the features mentioned above.

image

Hope this information can be helpful. If you have any further questions, please feel free to contact us. Many thanks!

@Zhaozeyong Thank you for your interest in TrackDiffusion. If you do not have further question, I would prefer to close this issue :)!