[FEATURE] Add image backbones from `MobileCLIP` paper

Question

[FEATURE] Add image backbones from `MobileCLIP` paper

rsomani95 opened this issue 4 months ago · comments

MobileCLIP is a really fast CLIP architecture for mobile inference - about 3x faster than the fastest publicly available CLIP backbone convnext_base_w for inference on iOS / macOS devices.

They introduce 3 novel image backbones: mci{0|1|2}. It would be amazing if these models were available directly via timm. I believe this would be an essential first step towards getting it into open_clip for fine-tuning.

The arch, defined here, uses MobileOne and FastVIT components, which are already available in timm. I'm not sure how compatible the re-implementation there is with the existing one in timm out of the box, but it smells like integration is definitely possible.

Ross Wightman · Answer 1 · Tue Mar 19 2024 03:09:56 GMT+0800 (China Standard Time)

@rsomani95 the components themselves are equivalent at a functional level, but the naming was remapped, so would have to remap for this model as well...

Ross Wightman · Answer 2 · Fri Mar 22 2024 04:17:25 GMT+0800 (China Standard Time)

@rsomani95 I took a closer look at this s1/s2 (mc1/mc2) are the easiest, could probably map those to OpenCLIP w/ a timm FastViT encoder (after a few additions and a key remapping for weights). I think the text encoder for those is compatible.

S0 uses a repmixer based text encoder so would need new code in OpenCLIP as well. The image encoder would map to a tweaked ver of FastViT.

The B model uses a ViT w/ a different stem, doable. I really like ViT NOT having BatchNorm though so a shame that it's now a ViT Base w/ BN in the stem.

Rahul Somani · Answer 3 · Fri Mar 22 2024 04:51:54 GMT+0800 (China Standard Time)

@rwightman thanks for looking into that. That's really great to hear re. s1/s2 as those, in my eyes, sit in the perfect sweetspot of speed + accuracy. Given your observations, maybe it makes sense to port those two alone first? Is there something in particular I could help with?