google-research / scenic

Scenic: A Jax Library for Computer Vision Research and Beyond

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[OWL-ViT v2] Is the model support LoRA training paradigm?

BIGBALLON opened this issue · comments

Thank you for your great work[OWL-ViT].

  • I have a question, is this model capable of fine-tuning by LoRa technology to adapt to downstream other tasks while keeping the original model unchanged?
  • Are there any fine-tuning techniques that can efficiently adapt to downstream tasks, such as 10 categories, with 60 images in each category?

Thank you again.

We have not tried LoRA with this model but I see no fundamental reason why it shouldn't work. Please let us know how it goes if you try it!

We have had good results fine-tuning the models on small datasets. Small datasets may only require very few training steps and will over-fit otherwise, so I would do a sweep of training durations, e.g. 100, 200, 400, 800 ... steps and pick the best one.

You can also try the image-conditioned detection we describe in https://arxiv.org/pdf/2205.06230.pdf, where you get semantic embeddings for the target objects in your training images and use them instead of text queries. You can use multiple examples by averaging the embeddings for all the examples of a given category. This colab section has an example for how to do image-conditioned detection: https://colab.research.google.com/github/google-research/scenic/blob/main/scenic/projects/owl_vit/notebooks/OWL_ViT_minimal_example.ipynb#scrollTo=8-hhGqbZzVfX

Thank you for such a quick reply. My doubts have been resolved. Thank you. @mjlm