Search All: Cross Modal Retrieval 📜🎵📷
You can search a collection of images using text
, images
or audio
.
YouTube Video: https://www.youtube.com/watch?v=gGeUbwiljSE
Warning The model is really bad with multi modalities, text is way heavier than the others, I've tried a couple of things but they didn't work - assuming I was not dumb, I wouldn't reccomand using this model for anything real
For this app, we have embedded stable diffusion generated images from lexica
eval set, you can now search them using text, images or audio. The embeddings are generated using Meta ImageBind a new powerful model capable of handling a lots of modalities.
Installation
Python
You have the same installation requirements as the original imagebind plus a couple of more packages. You can install them all by
conda create --name imagebind python=3.8 -y
conda activate imagebind
pip install -r requirements.txt
Then you can run the app by
gradio app.py
It should download the correct model, it might take a while.
Docker
Be sure the have both docker and nvidia-docker installed. Build the container
docker build -t search-all .
Run it
docker run --gpus all --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -v $PWD/.checkpoints:/workspace/.checkpoints -p 7860:7860 search-all
It should download the correct model, it might take a while.
Then, http://localhost:7860
Seed the database
We use images from lexica
gently borrowed by this hugging face dataset. You need to download them and edit scripts/create_embeddings.py
.
Contributing
Thanks a lot for considering. The only requirements is that you run make style
so all the code is nice and formatted :)