m-bain / frozen-in-time

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [ICCV'21]

Home Page:https://arxiv.org/abs/2104.00650

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

"img should be PIL Image" when fine-tuning on MSR-VTT

bryant1410 opened this issue · comments

I got the following error when trying to run python train.py --config configs/msrvtt_4f_i21k.json (as in the README):

  File "***/base/base_dataset.py", line 107, in __getitem__
    imgs = self.transforms(imgs)
  File "***/envs/frozen/lib/python3.7/site-packages/torchvision/transforms/transforms.py", line 60, in __call__
    img = t(img)
  File "***/envs/frozen/lib/python3.7/site-packages/torchvision/transforms/transforms.py", line 195, in __call__
    return F.resize(img, self.size, self.interpolation)
  File "***/envs/frozen/lib/python3.7/site-packages/torchvision/transforms/functional.py", line 229, in resize
    raise TypeError('img should be PIL Image. Got {}'.format(type(img)))
TypeError: img should be PIL Image. Got <class 'torch.Tensor'>

(I set up the env as described in the README)

Seems like the frames are obtained as torch tensors but then the transforms need a PIL Image:

frames = torch.stack([frames[idx] for idx in frame_idxs]).float() / 255
frames = frames.permute(0, 3, 1, 2)
return frames, frame_idxs

frames = torch.stack(frames).float() / 255
cap.release()
return frames, success_idxs

If I add a transforms.ToPILImage() before (and a transforms.ToTensor() after) in here

'val': transforms.Compose([
transforms.Resize(center_crop),
transforms.CenterCrop(center_crop),
transforms.Resize(input_res),
normalize,
]),

it still doesn't work because it needs an image, not multiple images. It also makes me think that the transforms actually won't work when having multiple PIL images.

Seems like the transforms are the incorrect ones? Or am I missing something?

Wait, I think it's because I ended up with a previous torchvision version. Lemme see this before checking out this issue.

Confirmed.