lucidrains / vit-pytorch

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Hot to get the features and positional embeddig information

marcomameli1992 opened this issue · comments

Dear,
I would use your package but not for classification I need it to extract information from images and get these as output and in addition to that I need to get the positional embedding information to reconstruct the features images.

Thank you so much.

@marcomameli1992 hi Marco, you just need to modify ViT to have a return statement here https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/vit.py#L123 for the embeddings

i guess i could add this, but i don't want to cloud how simple and clear the code is atm

@marcomameli1992 what do you mean by the positional embedding? the absolute positional embeddings are added at the beginning before it is fed through the attention layers, and can be accessed as v.pos_embedding

@marcomameli1992 actually, let me just write up a layer extractor that can wrap the ViT and return all these intermediates, similar to https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/recorder.py