[OWL-ViT v2] Is the model support LoRA training paradigm?

Question

[OWL-ViT v2] Is the model support LoRA training paradigm?

BIGBALLON opened this issue 9 months ago · comments

WILL LEE commented 9 months ago

Thank you for your great work[OWL-ViT].

I have a question, is this model capable of fine-tuning by LoRa technology to adapt to downstream other tasks while keeping the original model unchanged?
Are there any fine-tuning techniques that can efficiently adapt to downstream tasks, such as 10 categories, with 60 images in each category?

Thank you again.

Matthias Minderer · Answer 1 · Tue Oct 10 2023 17:10:34 GMT+0800 (China Standard Time)

We have not tried LoRA with this model but I see no fundamental reason why it shouldn't work. Please let us know how it goes if you try it!

We have had good results fine-tuning the models on small datasets. Small datasets may only require very few training steps and will over-fit otherwise, so I would do a sweep of training durations, e.g. 100, 200, 400, 800 ... steps and pick the best one.

You can also try the image-conditioned detection we describe in https://arxiv.org/pdf/2205.06230.pdf, where you get semantic embeddings for the target objects in your training images and use them instead of text queries. You can use multiple examples by averaging the embeddings for all the examples of a given category. This colab section has an example for how to do image-conditioned detection: https://colab.research.google.com/github/google-research/scenic/blob/main/scenic/projects/owl_vit/notebooks/OWL_ViT_minimal_example.ipynb#scrollTo=8-hhGqbZzVfX

WILL LEE · Answer 2 · Tue Oct 10 2023 17:37:25 GMT+0800 (China Standard Time)

Thank you for such a quick reply. My doubts have been resolved. Thank you. @mjlm