pixeli99 / TrackDiffusion

作者您好，对您的这篇工作非常感兴趣，但是在阅读完论文之后，对于TrackDiffusion的pipeline还是很困惑，不知道是如何从输入到输出的，并且对于论文Fig.2 也是看得迷迷糊糊的，麻烦您有空解惑一下

Hi,
Zeyong,

Thanks for your interest in TrackDiffusion!
The situation is as follows: our model is designed to achieve video generation controlled by bbox, meaning the position of the target in each frame is controlled by a set of box coordinates. Taking the code of our pipeline as an example, in this line, these are the actual parameters passed during the inference process:

TrackDiffusion/pipelines/pipeline_text_to_video_synth.py

Line 452 in aa71eb8

def __call__(

We should focus primarily on these three parameters, which are:

bbox_prompt: List[List[float]] = None,
video_masks: torch.FloatTensor = None,
seg_phrases: Union[str, List[str]] = None,

bbox coordinates, used to control the target position
Used to identify whether the target is for padding, as our total number of targets is fixed(20 targets). In some cases, the actual number of targets is less than this number, so padding is needed, and we need to mask off the parts that are padded
Corresponds one-to-one with bbox coordinates, indicating the category of each bbox

The handling of these three parameters mainly focuses on this operation, which is consistent with GLIGEN.

TrackDiffusion/pipelines/pipeline_text_to_video_synth.py

Line 596 in aa71eb8

if bbox_prompt is not None:

The above is regarding the input part.

During the forward process, we introduce an additional module called the instance enhancer. To summarize this module in one sentence:

It fuses the features of the same instance across the frame/time dimension and then finds a way to attend to their corresponding areas.

So the next questions are

Where do the features come from?
How to attend?

Q1:
Since we know the specific location of each target in each frame, we can extract the target's feature on the latent based on this location, then perform roi align (to fix the shape), and then perform fusion through the attn layer.
Q2:
In our implementation, this is achieved through an additional cross attn layer, where the query is the unet latent, and the key and value are the features mentioned above.

Hope this information can be helpful. If you have any further questions, please feel free to contact us. Many thanks!

@Zhaozeyong Thank you for your interest in TrackDiffusion. If you do not have further question, I would prefer to close this issue :)!

Confusion about the pipeline