There are 15 repositories under multi-modal-learning topic.
An open source implementation of CLIP.
Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.
Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text Integration
A concise but complete implementation of CLIP with various experimental improvements from recent papers
A curated list of Visual Question Answering(VQA)(Image/Video Question Answering),Visual Question Generation ,Visual Dialog ,Visual Commonsense Reasoning and related area.
[CVPR 2024 & NeurIPS 2024] EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI
CVPR 2023-2024 Papers: Dive into advanced research presented at the leading computer vision conference. Keep up to date with the latest developments in computer vision and deep learning. Code included. ⭐ support visual intelligence development!
Knowledge Graphs Meet Multi-Modal Learning: A Comprehensive Survey
Minimal sharded dataset loaders, decoders, and utils for multi-modal document, image, and text datasets.
The official repository of Achelous and Achelous++
[NeurIPS2023] Parameter-efficient Tuning of Large-scale Multimodal Foundation Model
A python tool to perform deep learning experiments on multimodal remote sensing data.
Pytorch version of the HyperDenseNet deep neural network for multi-modal image segmentation
An official implementation of Advancing Radiograph Representation Learning with Masked Record Modeling (ICLR'23)
[NeurIPS 2023] A faithful benchmark for vision-language compositionality
Japanese CLIP by rinna Co., Ltd.
A curated list of vision-and-language pre-training (VLP). :-)
Code for the IEEE Signal Processing Letters 2022 paper "UAVM: Towards Unifying Audio and Visual Models".
This repository contains code to download data for the preprint "MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning"
[ICLR 2024 Spotlight] This is the official code for the paper "SNIP: Bridging Mathematical Symbolic and Numeric Realms with Unified Pre-training"
MMEA: Entity Alignment for Multi-Modal Knowledge Graphs, KSEM 2020
[arXiv'23] HGCLIP: Exploring Vision-Language Models with Graph Representations for Hierarchical Understanding
Multi-modal Object Re-identification
SAM-SLR-v2 is an improved version of SAM-SLR for sign language recognition.