Zasder3 / train-CLIP

A PyTorch Lightning solution to training OpenAI's CLIP from scratch.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Problem related to encoding text

styler00dollar opened this issue · comments

commented

I am trying to use a resnet50 model that I created with this repo, but I can't encode text.

with torch.no_grad():
    tmp = clip.tokenize("test")
    tmp = tmp.to(device)
    print(tmp)
    print(tmp.shape)
    text_encoded = model.model.encode_text(tmp)
tensor([[49406,  1628, 49407,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0]], device='cuda:0')
torch.Size([1, 77])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-18-68003eb3bebb> in <module>()
      9     print(tmp)
     10     print(tmp.shape)
---> 11     text_encoded = model.model.encode_text(tmp)
     12 

2 frames
/content/train-CLIP/models/model.py in encode_text(self, text)
    343         x = x + self.positional_embedding.type(self.dtype)
    344         x = x.permute(1, 0, 2)  # NLD -> LND
--> 345         x = self.transformer(x)
    346         x = x.permute(1, 0, 2)  # LND -> NLD
    347         x = self.ln_final(x).type(self.dtype)

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

/usr/local/lib/python3.7/dist-packages/transformers/models/bert/modeling_bert.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
    937         elif input_ids is not None:
    938             input_shape = input_ids.size()
--> 939             batch_size, seq_length = input_shape
    940         elif inputs_embeds is not None:
    941             input_shape = inputs_embeds.size()[:-1]

ValueError: too many values to unpack (expected 2)

Printing x before self.transformer(x) results in torch.Size([77, 1, 512]).

The input shape torch.Size([1, 77]) does match the original clip code and the model loaded with clip seems to work without major problems.

import torch
import clip
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device, jit=False)

image = preprocess(Image.open("/test.png")).unsqueeze(0).to(device)
text = clip.tokenize(["test"]).to(device)
print(text)
print(text.shape)

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    
    logits_per_image, logits_per_text = model(image, text)
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()
tensor([[49406,  1628, 49407,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0]], device='cuda:0')
torch.Size([1, 77])

Not sure what I am doing wrong, since encoding images does seem to work fine with this repo.

with torch.no_grad():
    photos_features = model.model.encode_image(image)
    photos_features /= photos_features.norm(dim=-1, keepdim=True)

print(photos_features.shape)
torch.Size([1, 768])

I'm currently unable to reproduce this issue sadly:

Screenshot from 2021-07-09 14-34-31

Would you mind sharing which training script you used, and how you initialize model?

After looking further, it seems that your text transformer comes from Hugging Face's transformers library. Here's an example of how to tokenize and predict using that model:

encoded_text = tokenizer(sentence_list, return_tensors='pt')
model.encode_text(encoded_text)

Does this fix your problem?

commented

Here is a Google Colab to replicate the issue. Upload the file to Google Colab, or change paths to do so locally with jupyter. I also saved all error messages in that jupyter notebook. My assumption is that it may be related to pip package versions.
CLIP_bug.zip

The only major thing I added was checkpoint.py to save .pth files during training. I use the code from train_finetune.py to create the model and load a state dict into that model. Since you have "model" and "teacher" in a checkpoint, I do model.model to use the actual model.

import clip
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
from models import CustomCLIPWrapper
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("johngiorgi/declutr-sci-base")
txt_encoder = AutoModel.from_pretrained("johngiorgi/declutr-sci-base")

from torchvision.models import resnet50
img_encoder = resnet50(pretrained=True)
img_encoder.fc = torch.nn.Linear(2048, 768)

model = CustomCLIPWrapper(img_encoder, txt_encoder, 0, avg_word_embs=True)
model.load_state_dict(torch.load("/content/test.pth"))
model.to(device)

with torch.no_grad():
    tmp = clip.tokenize(["test"])
    tmp = tmp.to(device)
    print(tmp)
    print(tmp.shape)
    text_encoded = model.model.encode_text(tmp)

I tried the suggested code, but i got this instead.

with torch.no_grad():
    encoded_text = tokenizer(["test"], return_tensors='pt')
    model.model.encode_text(encoded_text)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-9-ffb3e976c349> in <module>()
      1 with torch.no_grad():
      2     encoded_text = tokenizer("test", return_tensors='pt')
----> 3     model.model.encode_text(encoded_text)

/content/train-CLIP/models/model.py in encode_text(self, text)
    339 
    340     def encode_text(self, text):
--> 341         x = self.token_embedding(text).type(self.dtype)  # [batch_size, n_ctx, d_model]
    342 
    343         x = x + self.positional_embedding.type(self.dtype)

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/sparse.py in forward(self, input)
    158         return F.embedding(
    159             input, self.weight, self.padding_idx, self.max_norm,
--> 160             self.norm_type, self.scale_grad_by_freq, self.sparse)
    161 
    162     def extra_repr(self) -> str:

/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   2041         # remove once script supports set_grad_enabled
   2042         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 2043     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
   2044 
   2045 

TypeError: embedding(): argument 'indices' (position 2) must be Tensor, not BatchEncoding

The script train_finetune.py trains the second class in wrapper.py called CustomCLIPWrapper. This class has a special embedding function already attached to the class. You don't need to call model.model.encode_text as HF models have a different way to be called. The function you are calling is the original CLIP models forward which fails because of a different tokenization protocol. If you call model.encode_text with the right tokenizer all should be good!

commented

I did not notice that there are 2 functions called encode_text and assumed there is only one.

with torch.no_grad():
    encoded_text = tokenizer(["test"], return_tensors='pt').to(device)
    result = model.encode_text(encoded_text)
    print(result)
tensor([[-7.9948e-01,  3.2338e-01,  1.7573e-01, -4.5223e-01, -2.1422e-01,
          3.6682e-02, -8.9392e-02, -1.0695e+00, -3.5576e-01,  1.2232e+00,
...

It seems to work, thank you.

Happy to help!