fuankarion / active-speakers-context

Code for the Active Speakers in Context Paper (CVPR2020)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Noticed bug in STE data augmentation (core/dataset.py#L158-L162)

btamm12 opened this issue · comments

In core/dataset.py#L158-L162

# random crop
width, height = video_data[0].size
f = random.uniform(0.5, 1)
i, j, h, w = RandomCrop.get_params(video_data[0], output_size=(int(height*f), int(width*f)))
video_data = [s.crop(box=(j, i, w, h)) for s in video_data]

[Source]

you pass the arguments (left, upper, width, height) into Image.crop() when it should be (left, upper, right, lower). The result is that the training boxes are smaller than intended.

PyTorch's implementation is the following

def crop(img: Image.Image, top: int, left: int, height: int, width: int) -> Image.Image:
    if not _is_pil_image(img):
        raise TypeError('img should be PIL Image. Got {}'.format(type(img)))

    return img.crop((left, top, left + width, top + height))

[Source]

with (top, left, height, width) the output of RandomCrop.get_params(). I would recommend using the following to avoid argument-conversion mistakes.

# random crop
width, height = video_data[0].size
f = random.uniform(0.5, 1)
crop_module = RandomCrop(size=(int(height*f), int(width*f)))
video_data = [crop_module.forward(img) for img in images]

Visualization (first existing code, then fixed code)

Note: the exact position of the crop should be ignored. Only the image size is relevant. Also notice how the old pictures are generally not square crops.

f == 0.98 : negligible
old-f98
new-f98

f == 0.52 : significant
old-f52
new-f52

I haven't run your code with this fix, so I don't know how much the results would improve (if at all).

Hi,

Thanks, we also discovered this bug recently. The effect on performance is minimal, but after fixing it the model does converge faster (it needs only 70-80 epochs). I'll post the updated code this week.