[FEATURE] Add image backbones from `MobileCLIP` paper
rsomani95 opened this issue · comments
MobileCLIP is a really fast CLIP architecture for mobile inference - about 3x faster than the fastest publicly available CLIP backbone convnext_base_w
for inference on iOS / macOS devices.
They introduce 3 novel image backbones: mci{0|1|2}
. It would be amazing if these models were available directly via timm
. I believe this would be an essential first step towards getting it into open_clip
for fine-tuning.
The arch, defined here, uses MobileOne and FastVIT components, which are already available in timm
. I'm not sure how compatible the re-implementation there is with the existing one in timm
out of the box, but it smells like integration is definitely possible.
@rsomani95 the components themselves are equivalent at a functional level, but the naming was remapped, so would have to remap for this model as well...
@rsomani95 I took a closer look at this s1/s2 (mc1/mc2) are the easiest, could probably map those to OpenCLIP w/ a timm FastViT encoder (after a few additions and a key remapping for weights). I think the text encoder for those is compatible.
S0 uses a repmixer based text encoder so would need new code in OpenCLIP as well. The image encoder would map to a tweaked ver of FastViT.
The B model uses a ViT w/ a different stem, doable. I really like ViT NOT having BatchNorm though so a shame that it's now a ViT Base w/ BN in the stem.
@rwightman thanks for looking into that. That's really great to hear re. s1/s2 as those, in my eyes, sit in the perfect sweetspot of speed + accuracy. Given your observations, maybe it makes sense to port those two alone first? Is there something in particular I could help with?