lucidrains / vit-pytorch

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Has a big time gap between GPU and CPU for Rearrange op.

Bobo-y opened this issue · comments

commented

for vit.py, the code of to_patch_embedding:
self.to_patch_embedding = nn.Sequential( Rearrange('b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1 = patch_height, p2 = patch_width), nn.Linear(patch_dim, dim), )

I test their inference time on cpu and gpu,

    'def forward(self, img):
    begin = time.time()
    x = self.to_patch_embedding(img)
    print('time of to_patch_embedding {}'.format(time.time() - begin))
    b, n, _ = x.shape
    cls_tokens = repeat(self.cls_token, '() n d -> b n d', b = b)
    x = torch.cat((cls_tokens, x), dim=1)
    begin_pos = time.time()
    x += self.pos_embedding[:, :(n + 1)]
    print('time of pos_embedding {}'.format(time.time() - begin_pos))
    x = self.dropout(x)
    begin_trans = time.time()
    x = self.transformer(x)
    print('time of transformer {}'.format(time.time() - begin_trans))'

for cpu infer,
image

for gpu infer
image

When using GPU, to_ path_ Embedding is much slower. How can we improve it?