Vocabulary-free Image Classification

Alessandro Conti, Enrico Fini, Massimiliano Mancini, Paolo Rota, Yiming Wang, Elisa Ricci

Recent advances in large vision-language models have revolutionized the image classification paradigm. Despite showing impressive zero-shot capabilities, a pre-defined set of categories, a.k.a. the vocabulary, is assumed at test time for composing the textual prompts. However, such assumption can be impractical when the semantic context is unknown and evolving. We thus formalize a novel task, termed as Vocabulary-free Image Classification (VIC), where we aim to assign to an input image a class that resides in an unconstrained language-induced semantic space, without the prerequisite of a known vocabulary. VIC is a challenging task as the semantic space is extremely large, containing millions of concepts, with hard-to-discriminate fine-grained categories.


Vision Language Model (VLM)-based classification	Vocabulary-free Image Classification

In this work, we first empirically verify that representing this semantic space by means of an external vision-language database is the most effective way to obtain semantically relevant content for classifying the image. We then propose Category Search from External Databases (CaSED), a method that exploits a pre-trained vision-language model and an external vision-language database to address VIC in a training-free manner. CaSED first extracts a set of candidate categories from captions retrieved from the database based on their semantic similarity to the image, and then assigns to the image the best matching candidate category according to the same vision-language model. Experiments on benchmark datasets validate that CaSED outperforms other complex vision-language frameworks, while being efficient with much fewer parameters, paving the way for future research in this direction.


Overview of CaSED. Given an input image, CaSED retrieves the most relevant captions from an external database filtering them to extract candidate categories. We classify image-to-text and text-to-text, using the retrieved captions centroid as the textual counterpart of the input image.

Inference

Our model CaSED is available on HuggingFace Hub. You can try it directly from the demo or import it from the transformers library.

To use the model from the HuggingFace Hub, you can use the following snippet:

import requests
from PIL import Image
from transformers import AutoModel, CLIPProcessor

# download an image from the internet
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

# load the model and the processor
model = AutoModel.from_pretrained("altndrr/cased", trust_remote_code=True)
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

# get the model outputs
images = processor(images=[image], return_tensors="pt", padding=True)
outputs = model(images, alpha=0.5)
labels, scores = outputs["vocabularies"][0], outputs["scores"][0]

# print the top 5 most likely labels for the image
values, indices = scores.topk(5)
print("\nTop predictions:\n")
for value, index in zip(values, indices):
    print(f"{labels[index]:>16s}: {100 * value.item():.2f}%")

Note that our model depends on some libraries you have to install manually. Please refer to the model card for further details.

Setup

Install dependencies

# clone project
git clone https://github.com/altndrr/vic
cd vic

# install requirements
# it will create a .venv folder in the project root
# and install all the dependencies using flit
make install

# activate virtual environment
source .venv/bin/activate

Setup environment variables

# copy .env.example to .env
cp .env.example .env

# edit .env file
vim .env

Usage

The two entry points are train.py and eval.py. Calling them without any argument will use the default configuration.

# train model
python src/train.py

# test model
python src/eval.py

Configuration

The full list of parameters can be found under configs, but the most important ones are:

data: dataset to use, default to caltech101.
experiment: experiment to run, default to baseline/clip.
logger: logger to use, default to null.

Parameters can be overwritten by passing them as command line arguments. You can additionally override any parameter from the config file by using the ++ prefix.

# train model on ucf101 dataset
python src/train.py data=ucf101 experiment=baseline/clip

# train model on ucf101 dataset with RN50 backbone
python src/train.py data=ucf101 experiment=baseline/clip model=clip ++model.model_name=RN50

Note that since all our approaches are training-free, there is virtually no difference between train.py and eval.py. However, we still keep them separate for clarity.

Docker containers

We provide Docker images for the deployment of containerized services. Currently, the only available container is the one for the retrieval server. To start the server, run the following command:

# build the Docker images
docker compose build

# start the server
docker compose --profile retrieval-server up

Development

Install pre-commit hooks

# install pre-commit hooks
pre-commit install

Run tests

# run fast tests
make test

# run all tests
make test-full

Format code

# run linters
make format

Clean repository

# remove autogenerated files
make clean

# remove logs
make clean-logs

Citation

@misc{conti2023vocabularyfree,
      title={Vocabulary-free Image Classification},
      author={Alessandro Conti and Enrico Fini and Massimiliano Mancini and Paolo Rota and Yiming Wang and Elisa Ricci},
      year={2023},
      eprint={2306.00917},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

zhaoxin94 / vic