There are 16 repositories under multi-modal-learning topic.
An open source implementation of CLIP.
Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.
Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text Integration
A concise but complete implementation of CLIP with various experimental improvements from recent papers
A curated list of Visual Question Answering(VQA)(Image/Video Question Answering),Visual Question Generation ,Visual Dialog ,Visual Commonsense Reasoning and related area.
[CVPR 2024 & NeurIPS 2024] EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI
Knowledge Graphs Meet Multi-Modal Learning: A Comprehensive Survey
CVPR 2023-2024 Papers: Dive into advanced research presented at the leading computer vision conference. Keep up to date with the latest developments in computer vision and deep learning. Code included. ⭐ support visual intelligence development!
[CVPR 2024] Official PyTorch Code for "PromptKD: Unsupervised Prompt Distillation for Vision-Language Models"
[ICCV 2023] Implicit Neural Representation for Cooperative Low-light Image Enhancement
[CVPR2020] Unsupervised Multi-Modal Image Registration via Geometry Preserving Image-to-Image Translation
The official repository of Achelous and Achelous++
Minimal sharded dataset loaders, decoders, and utils for multi-modal document, image, and text datasets.
[ICML 2023] Contrast with Reconstruct: Contrastive 3D Representation Learning Guided by Generative Pretraining
Official pytorch repository for CG-DETR "Correlation-guided Query-Dependency Calibration in Video Representation Learning for Temporal Grounding"
A detection/segmentation dataset with labels characterized by intricate and flexible expressions. "Described Object Detection: Liberating Object Detection with Flexible Expressions" (NeurIPS 2023).
[ICCV-2023] The official code of Bridging Vision and Language Encoders: Parameter-Efficient Tuning for Referring Image Segmentation
【CVPR2024】Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification
[CVPR 2025] FLAIR: VLM with Fine-grained Language-informed Image Representations
[ICCV 2025] Official PyTorch Code for "Advancing Textual Prompt Learning with Anchored Attributes"
A python tool to perform deep learning experiments on multimodal remote sensing data.
[NeurIPS2023] Parameter-efficient Tuning of Large-scale Multimodal Foundation Model
[NeurIPS 2023] A faithful benchmark for vision-language compositionality
Welcome to the Awesome Multi-Modal Object Re-Identification Repository! This repository is dedicated to curating and sharing the latest methods, datasets, and resources focused specifically on the domain of multi-modal object re-identification. It brings together cutting-edge research, tools, and papers aimed at advancing the study and application.
An official implementation of Advancing Radiograph Representation Learning with Masked Record Modeling (ICLR'23)
This repository contains code to download data for the preprint "MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning"
Pytorch version of the HyperDenseNet deep neural network for multi-modal image segmentation
Japanese CLIP by rinna Co., Ltd.
[ICLR 2025] Duoduo CLIP: Efficient 3D Understanding with Multi-View Images
Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline (CVPR 2023)
[ICCV 2021] Official implementation of the paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering"
[NeurIPS 2024 Spotlight] Code for the paper "Flex-MoE: Modeling Arbitrary Modality Combination via the Flexible Mixture-of-Experts"
A curated list of vision-and-language pre-training (VLP). :-)