🏷️ Recognize Anything & Tag2Text

Official PyTorch Implementation of Recognize Anything: A Strong Image Tagging Model and Tag2Text: Guiding Vision-Language Model via Image Tagging.

Recognize Anything Model(RAM) is an image tagging model, which can recognize any common category with high accuracy.
Tag2Text is a vision-language model guided by tagging, which can support caption, retrieval and tagging.

Both Tag2Text and RAM exihibit strong recognition ability. We have combined Tag2Text and RAM with localization models (Grounding-DINO and SAM) and developed a strong visual semantic analysis pipeline in the Grounded-SAM project.

💡 Highlight of RAM

RAM is a strong image tagging model, which can recognize any common category with high accuracy.

Strong and general. RAM exhibits exceptional image tagging capabilities with powerful zero-shot generalization;
- RAM showcases impressive zero-shot performance, significantly outperforming CLIP and BLIP.
- RAM even surpasses the fully supervised manners (ML-Decoder).
- RAM exhibits competitive performance with the Google tagging API.
Reproducible and affordable. RAM requires Low reproduction cost with open-source and annotation-free dataset;
Flexible and versatile. RAM offers remarkable flexibility, catering to various application scenarios.

(Green color means fully supervised learning and Blue color means zero-shot performance.)

RAM significantly improves the tagging ability based on the Tag2text framework.

Accuracy. RAM utilizes a data engine to generate additional annotations and clean incorrect ones, higher accuracy compared to Tag2Text.
Scope. RAM upgrades the number of fixed tags from 3,400+ to 6,400+ (synonymous reduction to 4,500+ different semantic tags), covering more valuable categories. Moreover, RAM is equipped with open-set capability, feasible to recognize tags not seen during training

🌅 Highlight of Tag2text

Tag2Text is an efficient and controllable vision-language model with tagging guidance.

Tagging. Tag2Text recognizes 3,400+ commonly human-used categories without manual annotations.
Captioning. Tag2Text integrates tags information into text generation as the guiding elements, resulting in more controllable and comprehensive descriptions.
Retrieval. Tag2Text provides tags as additional visible alignment indicators for image-text retrieval.

✍️ TODO

Release Tag2Text demo.
Release checkpoints.
Release inference code.
Release RAM demo and checkpoints.
Release training codes (until July 8st at the latest).
Release training datasets (until July 15st at the latest).

🧰 Checkpoints

	Name	Backbone	Data	Illustration	Checkpoint
1	RAM-14M	Swin-Large	COCO, VG, SBU, CC-3M, CC-12M	Provide strong image tagging ability.	Download link
2	Tag2Text-14M	Swin-Base	COCO, VG, SBU, CC-3M, CC-12M	Support comprehensive captioning and tagging.	Download link

🏃 Model Inference

Setting Up

Install the dependencies::

pip install -r requirements.txt

Download RAM pretrained checkpoints.
(Optional) To use RAM and Tag2Text in other projects, better to install recognize-anything as a package:

pip install -e .

Then the RAM and Tag2Text model can be imported in other projects:

from ram.models import ram, tag2text_caption

RAM Inference

Get the English and Chinese outputs of the images:

python inference_ram.py  --image images/demo/demo1.jpg 

--pretrained pretrained/ram_swin_large_14m.pth

RAM Inference on Unseen Categories (Open-Set)

Firstly, custom recognition categories in build_openset_label_embedding, then get the tags of the images:

python inference_ram_openset.py  --image images/openset_example.jpg 

--pretrained pretrained/ram_swin_large_14m.pth

Tag2Text Inference

Get the tagging and captioning results:

python inference_tag2text.py  --image images/demo/demo1.jpg 

--pretrained pretrained/tag2text_swin_14m.pth

Or get the tagging and sepcifed captioning results (optional):

python inference_tag2text.py  --image images/demo/demo1.jpg 

--pretrained pretrained/tag2text_swin_14m.pth 

--specified-tags "cloud,sky"

Batch Inference

Get the tagging and captioning results on multiple images:

python batch_inference.py --image-dir images/demo/ 

--pretrained pretrained/tag2text_swin_14m.pth --model-type tag2text

✒️ Citation

If you find our work to be useful for your research, please consider citing.

@article{zhang2023recognize,
  title={Recognize Anything: A Strong Image Tagging Model},
  author={Zhang, Youcai and Huang, Xinyu and Ma, Jinyu and Li, Zhaoyang and Luo, Zhaochuan and Xie, Yanchun and Qin, Yuzhuo and Luo, Tong and Li, Yaqian and Liu, Shilong and others},
  journal={arXiv preprint arXiv:2306.03514},
  year={2023}
}

@article{huang2023tag2text,
  title={Tag2Text: Guiding Vision-Language Model via Image Tagging},
  author={Huang, Xinyu and Zhang, Youcai and Ma, Jinyu and Tian, Weiwei and Feng, Rui and Zhang, Yuejie and Li, Yaqian and Guo, Yandong and Zhang, Lei},
  journal={arXiv preprint arXiv:2303.05657},
  year={2023}
}

♥️ Acknowledgements

This work is done with the help of the amazing code base of BLIP, thanks very much!

We want to thank @Cheng Rui @Shilong Liu @Ren Tianhe for their help in marrying RAM/Tag2Text with Grounded-SAM.

We also want to thank Ask-Anything, Prompt-can-anything for combining RAM/Tag2Text, which greatly expands the application boundaries of RAM/Tag2Text.

AMorporkian / recognize-anything