fuankarion / active-speakers-context

Code for the Active Speakers in Context Paper (CVPR2020)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Inference on my own data

JinmingZhao opened this issue · comments

Hi, Thank you for sharing your code and models!

I need to use you code and models on my own video data for other tasks.
My own videos are the movie data, I want know how to prepare the utility csv file ava_activespeaker_val_augmented.csv as model's input?

Thanks and looking forward for your reply!

Hi,

The process is a bit time consuming for arbitrary video data.

First thing you would do is to extract the face crops in the video data (locate and actually crop into smaller images), then you would build tracklets from these crops. A tracklet is a collection of face crops from the same identity that are continuously visible in the video..

Ideally a tracklet contains all the face crops from a visible person in a scene. The first element of the tracklet would be the face crop as the person appears on camera, you would continue locating the face until the final element, which would be the face crop at the very last frame the person is visible. You can relax this condition and just make sure that the tracklet contains face crops from the same identity.

After establishing the trackllets just give them some sort of unique identifier (can be anything) and the extract the audio tracks that exactly match the temporal span of each tracklet (some audio tracks might overlap that's not an issue
). Make a directory structure that looks like this

ASC
Image Data
tracklet_id_1
face_crop_1

face_crop_n
tracklet_id_2
face_crop_1

face_crop_n

tracklet_id_n
face_crop_1

face_crop_n
Audio data
audio_tracklet_id_1
audio_tracklet_id_2
…..
audio_tracklet_id_n

After this, the csv is kind of straightforward and each line describes an individual face crop

video_id: Can be anything just be consistent and assign different id to different videos.
frame_timestamp: timestamp of the frame where crop was extracted
entity_box_x1: Ignored
entity_box_y1: Ignored
entity_box_x2: Ignored
entity_box_y2: Ignored
label: either ‘NOT_SPEAKING’ or ‘SPEAKING_AUDIBLE’
entity_id: Use the id of the tracklet here.
Label id: 0 for ‘NOT_SPEAKING’, 1 for ‘‘SPEAKING_AUDIBLE’’
Instance id: see bellow

The instance id is required (and quite relevant) for the training of the ASC, but is ignored at inference time. You can just replicate the label_id value