vision-language

There are 9 repositories under vision-language topic.

IDEA-Research / GroundingDINO
[ECCV 2024] Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"
object-detection open-world open-world-detection vision-language vision-language-transformer
Language:Python 9219
OFA-Sys / Chinese-CLIP
Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.
chinese computer-vision multi-modal-learning nlp pytorch vision-and-language-pre-training image-text-retrieval clip pretrained-models vision-language deep-learning multi-modal contrastive-loss transformers coreml-models
Language:Jupyter Notebook 5599
salesforce / BLIP
PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
vision-language vision-and-language-pre-training image-text-retrieval image-captioning visual-question-answering visual-reasoning vision-language-transformer
Language:Jupyter Notebook 5569
marqo
marqo-ai / marqo
Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai
deep-learning information-retrieval machinelearning vector-search tensor-search clip multi-modal search-engine transformers vision-language machine-learning semantic-search visual-search natural-language-processing hnsw knn hacktoberfest chatgpt gpt large-language-models
Language:Python 4987
OFA-Sys / OFA
Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
chinese image-captioning multimodal pretrained-models pretraining prompt prompt-tuning referring-expression-comprehension text-to-image-synthesis vision-language visual-question-answering
Language:Python 2538
AlibabaResearch / AdvancedLiterateMachinery
A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.
artificial-intelligence computer-vision document document-analysis document-intelligence document-recognition document-understanding documentai end-to-end-ocr multimodal multimodal-deep-learning ocr scene-text-detection scene-text-detection-recognition scene-text-recognition text-detection text-recognition vision-language vision-language-model vision-language-transformer
Language:C++ 1794
mbzuai-oryx / Video-ChatGPT
[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.
chatbot clip gpt-4 llama llava mulit-modal vicuna vision-language vision-language-pretraining video-chatboat video-conversation
Language:Python 1463
2U1 / Qwen-VL-Series-Finetune
An open-source implementaion for fine-tuning Qwen-VL series by Alibaba Cloud.
multimodal qwen2-5-vl qwen2-vl qwen3-vl vision-language vision-language-model
Language:Python 1362
llm-jp / awesome-japanese-llm
日本語LLMまとめ - Overview of Japanese LLMs
foundation-models generative-ai generative-model generative-models japanese japanese-language japanese-language-model japanese-llm language-model language-models large-language-model large-language-models llm llm-japanese llms multimodal vision-and-language vision-language vision-language-model
Language:TypeScript 1313
OpenDriveLab / DriveLM
[ECCV 2024 Oral] DriveLM: Driving with Graph Visual Question Answering
autonomous-driving large-language-models vision-language chain-of-thought graph-of-thoughts llm prompting tree-of-thoughts prompt-engineering
Language:HTML 1195
OFA-Sys / ONE-PEACE
A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
foundation-models multimodal representation-learning vision-language audio-language vision-and-language vision-transformer contrastive-loss
Language:Python 1053
google-research / pix2seq
Pix2Seq codebase: multi-tasks with generative modeling (autoregressive and diffusion)
pix2seq object-detection computer-vision vision-language deep-learning tensorflow2
Language:Jupyter Notebook 929
TinyLLaVA / TinyLLaVA_Factory
A Framework of Small-scale Large Multimodal Models
large-multimodal-models llama llava nlp tinyllama transformers vision-language
Language:Python 914
SunzeY / AlphaCLIP
[CVPR 2024] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
deep-learning machine-learning vision-language vision-language-model vision-transformer vision-and-language
Language:Jupyter Notebook 852
mbzuai-oryx / LLaVA-pp
🔥🔥 LLaVA++: Extending LLaVA with Phi-3 and LLaMA-3 (LLaVA LLaMA-3, LLaVA Phi-3)
conversation llama3 llava llm lmms phi3 vision-language llama-3-llava llama-3-vision llama3-llava llama3-vision llava-llama3 llava-phi3 phi-3-llava phi-3-vision phi3-llava phi3-vision
Language:Python 843
Algolzw / daclip-uir
[ICLR 2024] Controlling Vision-Language Models for Universal Image Restoration. 5th place in the NTIRE 2024 Restore Any Image Model in the Wild Challenge.
diffusion-models image-restoration prompt vision-language face-inpainting image-deblurring image-dehazing image-denoising image-deraining image-desnowing jpeg-artifacts-removal low-level-vision low-light-image-enhancement shadow-removal pytorch deep-learning all-in-one-image-restoration
Language:Python 789
mees / calvin
CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks
natural-language-processing robotics deep-learning grounding vision-language manipulation computer-vision pytorch vision vision-and-language
Language:Python 738
longzw1997 / Open-GroundingDino
This is the third party implementation of the paper Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection.
object-detection open-world open-world-detection vision-language
Language:Python 736
AILab-CVC / SEED
Official implementation of SEED-LLaMA (ICLR 2024).
foundation-model multimodal vision-language
Language:Python 630
cliport / cliport
CLIPort: What and Where Pathways for Robotic Manipulation
clip robotics vision deep-learning natural-language-processing grounding vision-language manipulation pytorch rearrangement computer-vision
Language:Jupyter Notebook 521
ChenDelong1999 / RemoteCLIP
🛰️ Official repository of paper "RemoteCLIP: A Vision Language Foundation Model for Remote Sensing" (IEEE TGRS)
remote-sensing vision-language contrastive-language-image-pretraining
Language:Jupyter Notebook 470
airaria / Visual-Chinese-LLaMA-Alpaca
多模态中文LLaMA&Alpaca大语言模型（VisualCLA）
alpaca chinese llama llm lora multimodal nlp vision-language
Language:Python 456
zdou0830 / METER
METER: A Multimodal End-to-end TransformER Framework
vision-language
Language:Python 373
HUANGLIZI / LViT
[IEEE Transactions on Medical Imaging/TMI 2023] This repo is the official implementation of "LViT: Language meets Vision Transformer in Medical Image Segmentation"
medical-image-analysis pytorch segmentation vision-language multimodal-learning
Language:Python 366
henghuiding / Vision-Language-Transformer
[ICCV2021 & TPAMI2023] Vision-Language Transformer and Query Generation for Referring Segmentation
vision-language transformer referring-segmentation tensorflow keras iccv2021 vision-language-transformer tpami
Language:Python 359
zjysteven / lmms-finetune
A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision, llama-3.2-vision, qwen-vl, qwen2-vl, phi3-v etc.
finetuning foundation-models instruction-tuning large-language-model large-multimodal-models llava llava-next multimodal multimodal-large-language-models qwen-vl vision-language visual-instruction-tuning
Language:Python 352
WisconsinAIVision / ViP-LLaVA
[CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
chatbot clip foundation-models gpt-4 gpt-4-vision llama llama2 llava multi-modal vision-language visual-prompting cvpr2024
Language:Python 331
ALEEEHU / World-Simulator
Simulating the Real World: Survey & Resources, which contains our survey "Simulating the Real World: A Unified Survey of Multimodal Generative Models" and Awesome-Text2X-Resources. Watch this repository for the latest updates! 🔥
generative-models multimodal world-simulator aigc awesome-list survey text2x vision vision-language 3d-generation 4d-generation image-generation video-generation
305
movienet / movienet-tools
Tools for movie and video research
movie computer-vision video-understanding action-recognition deep-learning vision-language cross-modality shot-detection person-analysis
Language:C++ 295
TXH-mercury / VAST
[NIPS2023] Code and Model for VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
audio-language cross-modality-pretraining dataset multimodal-foundation-model vision-audio-subtitle-text vision-language
Language:Jupyter Notebook 291
mbzuai-oryx / VideoGPT-plus
Official Repository of paper VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
chatbot clip dual-encoder gpt4 gpt4o image-encoder llama3 llava multimodal phi-3-mini vicuna video-chatbot video-conversation video-encoder vision-language vision-language-pretraining
Language:Python 290
mertyg / vision-language-models-are-bows
Experiments and data for the paper "When and why vision-language models behave like bags-of-words, and what to do about it?" Oral @ ICLR 2023
multimodal pytorch vision-language blip clip compositionality
Language:Python 286
metauto-ai / Kaleido-BERT
💐Kaleido-BERT: Vision-Language Pre-training on Fashion Domain
bert pre-training multimodal vision-language
Language:Python 271
zjysteven / VLM-Visualizer
Visualizing the attention of vision-language models
attention attention-mechanism llava multi-modal vision-language vision-language-model
Language:Jupyter Notebook 252
MarSaKi / VLN-BEVBert
[ICCV 2023} Official repo of "BEVBert: Multimodal Map Pre-training for Language-guided Navigation"
embodied-ai transformer vision-language
Language:Python 241
qiantianwen / NuScenes-QA
[AAAI 2024] NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario.
autonomous-driving vision-language visual-question-answering
Language:Python 214