🏷️ Tag2Text: Guiding Vision-Language Model via Image Tagging

Official PyTorch Implementation of the Tag2Text, an efficient and controllable vision-language model with tagging guidance. Code is available now!

Welcome to try out Tag2Text Web demo🤗! Both Tagging and Captioning are included.

Tag2Text now is combine with Grounded-SAM, which can automatically recognize, detect, and segment for an image! Tag2Text showcases powerful image recognition capabilities:

🔥 News

2023/06/05: Prompt-can-anything ,gradio web library that integrates SOTA multimodal large models, which aims to become an "agent" to help you do anything, including Tag2text as the core model for graphic understanding ,Contains automatic visual tasks, voice, graphics, GPT and more . version_1.15 realease！
2023/05/20: Tag2Text is combined with VideoChat, Tag2Text provides powerful tagging and captioning capabilities as a fundamental component!
2023/04/20: We marry Tag2Text with with Grounded-SAM to provide powerful image recognition capabilities!
2023/04/10: Code and checkpoint is available Now!
2023/03/14: Tag2Text web demo 🤗 is available on Hugging Face Space!

💡 Highlight

Tagging. Without manual annotations, Tag2Text achieves superior image tag recognition ability of 3,429 commonly human-used categories.
Efficient. Tagging guidance effectively enhances the performance of vision-language models on both generation-based and alignment-based tasks.
Controllable. Tag2Text permits users to input desired tags, providing the flexibility in composing corresponding texts based on the input tags.

✍️ TODO

🧰 Checkpoints

	name	backbone	Data	Illustration	Checkpoint
1	Tag2Text-Swin	Swin-Base	COCO, VG, SBU, CC-3M, CC-12M	Demo version with comprehensive captions.	Download link

🏃 Model Inference

Install the dependencies, run:

pip install -r requirements.txt

Download Tag2Text pretrained checkpoints.
Get the tagging and captioning results:

python inference.py  --image images/1641173_2291260800.jpg \
--pretrained pretrained/tag2text_swin_14m.pth

Or get the tagging and sepcifed captioning results (optional):

python inference.py  --image images/1641173_2291260800.jpg \
--pretrained pretrained/tag2text_swin_14m.pth \
--specified-tags "cloud,sky"

✒️ Citation

If you find our work to be useful for your research, please consider citing.

@article{huang2023tag2text,
  title={Tag2Text: Guiding Vision-Language Model via Image Tagging},
  author={Huang, Xinyu and Zhang, Youcai and Ma, Jinyu and Tian, Weiwei and Feng, Rui and Zhang, Yuejie and Li, Yaqian and Guo, Yandong and Zhang, Lei},
  journal={arXiv preprint arXiv:2303.05657},
  year={2023}
}

♥️ Acknowledgements

This work is done with the help of the amazing code base of BLIP, thanks very much!

We also want to thank @Cheng Rui @Shilong Liu @Ren Tianhe for their help in marrying Tag2Text with Grounded-SAM.

About

Code for paper: Tag2Text: Guiding Vision-Language Model via Image Tagging

https://tag2text.github.io

MIT License

Languages

Language:Python 100.0%