There are 5 repositories under vision-language topic.
Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"
PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.
A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.
"Video-ChatGPT" is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.
Pix2Seq codebase: multi-tasks with generative modeling (autoregressive and diffusion)
日本語LLMまとめ - Overview of Japanese LLMs
DriveLM: Driving with Graph Visual Question Answering
[ICLR 2024] Controlling Vision-Language Models for Universal Image Restoration. 5th place in the NTIRE 2024 Restore Any Image Model in the Wild Challenge.
多模态中文LLaMA&Alpaca大语言模型(VisualCLA)
[ICCV2021 & TPAMI2023] Vision-Language Transformer and Query Generation for Referring Segmentation
💐Kaleido-BERT: Vision-Language Pre-training on Fashion Domain. (CVPR2021)
Tools for movie and video research
This is the third party implementation of the paper Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection.
A Framework of Small-scale Large Multimodal Models
Experiments and data for the paper "When and why vision-language models behave like bags-of-words, and what to do about it?" Oral @ ICLR 2023
🛰️ Official repository of paper "RemoteCLIP: A Vision Language Foundation Model for Remote Sensing" (IEEE TGRS)
Code and Model for VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
[ICCV 2023} Official repo of "BEVBert: Multimodal Map Pre-training for Language-guided Navigation"
[AAAI 2024] NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario.
PyTorch code for BagFormer: Better Cross-Modal Retrieval via bag-wise interaction
Pytorch code for Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners
MixGen: A New Multi-Modal Data Augmentation
[ICCV 2023] Official implementation of "PØDA: Prompt-driven Zero-shot Domain Adaptation"
[CVPR 2023] Official repository of paper titled "CLIP2Protect: Protecting Facial Privacy using Text-Guided Makeup via Adversarial Latent Search".