m-bain / frozen-in-time

I got the following error when trying to run python train.py --config configs/msrvtt_4f_i21k.json (as in the README):

  File "***/base/base_dataset.py", line 107, in __getitem__
    imgs = self.transforms(imgs)
  File "***/envs/frozen/lib/python3.7/site-packages/torchvision/transforms/transforms.py", line 60, in __call__
    img = t(img)
  File "***/envs/frozen/lib/python3.7/site-packages/torchvision/transforms/transforms.py", line 195, in __call__
    return F.resize(img, self.size, self.interpolation)
  File "***/envs/frozen/lib/python3.7/site-packages/torchvision/transforms/functional.py", line 229, in resize
    raise TypeError('img should be PIL Image. Got {}'.format(type(img)))
TypeError: img should be PIL Image. Got <class 'torch.Tensor'>

(I set up the env as described in the README)

Seems like the frames are obtained as torch tensors but then the transforms need a PIL Image:

frozen-in-time/base/base_dataset.py

Lines 294 to 296 in e6fc946

    
           frames = torch.stack([frames[idx] for idx in frame_idxs]).float() / 255 
        
           frames = frames.permute(0, 3, 1, 2) 
        
           return frames, frame_idxs

frozen-in-time/base/base_dataset.py

Lines 279 to 281 in e6fc946

    
           frames = torch.stack(frames).float() / 255 
        
           cap.release() 
        
           return frames, success_idxs

If I add a transforms.ToPILImage() before (and a transforms.ToTensor() after) in here

frozen-in-time/data_loader/transforms.py

Lines 18 to 23 in e6fc946

    
           'val': transforms.Compose([ 
        
               transforms.Resize(center_crop), 
        
               transforms.CenterCrop(center_crop), 
        
               transforms.Resize(input_res), 
        
               normalize, 
        
           ]),

it still doesn't work because it needs an image, not multiple images. It also makes me think that the transforms actually won't work when having multiple PIL images.

Seems like the transforms are the incorrect ones? Or am I missing something?

Wait, I think it's because I ended up with a previous torchvision version. Lemme see this before checking out this issue.

Confirmed.

	frames = torch.stack([frames[idx] for idx in frame_idxs]).float() / 255
	frames = frames.permute(0, 3, 1, 2)
	return frames, frame_idxs

	frames = torch.stack(frames).float() / 255
	cap.release()
	return frames, success_idxs

	'val': transforms.Compose([
	transforms.Resize(center_crop),
	transforms.CenterCrop(center_crop),
	transforms.Resize(input_res),
	normalize,
	]),

"img should be PIL Image" when fine-tuning on MSR-VTT