First follow the LLaVA README create the base environment.
Then install the packages for Mamba
pip install causal-conv1d
pip install mamba-ssm
Please download the 558K subset of the LAION-CC-SBU dataset with BLIP captions we use in the paper here.
Pretrain takes around 11 hours for Mamba-2.8B-LLaVA-v1.5 on 4x 3090 (24G).
Training script without DeepSpeed and bf16: pretrain_fp32.sh
.
--mm_projector_type mlp2x_gelu
: the two-layer MLP vision-language connector.--vision_tower openai/clip-vit-large-patch14-336
: CLIP ViT-L/14 336px.
coming soon ...