question about perceiver in code and paper

Question

question about perceiver in code and paper

imdoublecats opened this issue 7 months ago · comments

thank you for your works, but i have a small question about the paper and the code.
in the paper, the perceiver is implemented as
"For the visual encoder, we adopt a 12-layer 3D ViT with 768 feature dimensions and the
perceiver is chosen as 6-layer transformer decoder with the learnable latent array in 32 × 5120 dimension, so
that all images will be embeded as a 32 × 5120 feature embedding after passing visual encoding and perceiver
aggregation."
however in the code, the latent dim is 32x768, and it is extended to 32x5120 after a fc layer, im not quiet understand about the effect of the fc layer.
i think not using 5120 in the perceiver may be the reason of GPU memory limitation, and extend to 5120 in fc layer may be to decode the information in the latent to facilitate processing by subsequent models.
so i want to ask the effect of the fc layer.

chaoyi-wu · Answer 1 · Thu Jan 18 2024 13:12:09 GMT+0800 (China Standard Time)

Yes, the final 5120 dimension is projected through an fc layer.

Your assumption is right, 5120 in perceiver making the model toooo large, so we finally chose to project the token embedding to 5120 through shallow fc instead of a multi-layer perceiver. FC is just used to keep the dimension consistent with the latter LLM and perceiver is mainly used to compress the token length.

Sincerely thanks for you to point out the mistake, we will revise the paper.

yue wu · Answer 2 · Fri Jan 19 2024 10:09:12 GMT+0800 (China Standard Time)

@chaoyi-wu thank you