precision on imagenet experiment
Karami-m opened this issue · comments
Hi,
For imagenet, you mentioned in the paper the Hyena code is used for the experimentation by replacing MLP blocks in Hyena ViT-b with block-diagonal matrices, similarly to M2-BERT. Based on the config file: trainer: precision: 16
is used in Hyena, so I wonder if you use mixed precision bf16 here for imagenet (similar to M2-bert) to train it on A100 gpus or used simple 16-bit precision.
Also, in the sequence mixer of M2-bert, you replaced attention with bidirectional gated convolutions with a residual long convolution (Figure3 left). So I wonder if did the same for imagenet and included residual long convolution in the model? I am asking as the monarch is part of a residual sequence mixing layer which has a residual connection (although it is not a residual long convolution).