There are 40 repositories under multimodal-deep-learning topic.
LAVIS - A One-stop Library for Language-Vision Intelligence
(ෆ`꒳´ෆ) A Survey on Text-to-Image Generation/Synthesis.
A flexible package for multimodal-deep-learning to combine tabular data with text and images using Wide and Deep models in Pytorch
收集 CVPR 最新的成果,包括论文、代码和demo视频等,欢迎大家推荐!Collect the latest CVPR (Conference on Computer Vision and Pattern Recognition) results, including papers, code, and demo videos, etc., and welcome recommendations from everyone!
Recent Advances in Vision and Language PreTrained Models (VL-PTMs)
awesome grounding: A curated list of research papers in visual grounding
A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.
This repository contains various models targetting multimodal representation learning, multimodal fusion for downstream tasks such as multimodal sentiment analysis.
Official implementation for "Blended Latent Diffusion" [SIGGRAPH 2023]
A collection of parameter-efficient transfer learning papers focusing on computer vision and multimodal domains.
A collection of resources on applications of multi-modal learning in medical imaging.
收集 ECCV 最新的成果,包括论文、代码和demo视频等,欢迎大家推荐!
Recent Advances in Vision and Language Pre-training (VLP)
Deep learning based content moderation from text, audio, video & image input modalities.
Multimodal Sarcasm Detection Dataset
List of academic resources on Multimodal ML for Music
A Survey on multimodal learning research.
This repo contains evaluation code for the paper "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI"
TensorFlow implementation of "Multimodal Speech Emotion Recognition using Audio and Text," IEEE SLT-18
A comprehensive reading list for Emotion Recognition in Conversations
Paper List of Pre-trained Foundation Recommender Models
[CVPR'22 Best Paper Finalist] Official PyTorch implementation of the method presented in "Learning Multi-View Aggregation In the Wild for Large-Scale 3D Semantic Segmentation"
Code and Pretrained Models for ICLR 2023 Paper "Contrastive Audio-Visual Masked Autoencoder".
This repository contains the code for a video captioning system inspired by Sequence to Sequence -- Video to Text. This system takes as input a video and generates a caption in English describing the video.
Pytorch implementation of Multimodal Fusion Transformer for Remote Sensing Image Classification.
A Python package housing a collection of deep-learning multi-modal data fusion method pipelines! From data loading, to training, to evaluation - fusilli's got you covered 🌸
A curated list of awesome vision and language resources for earth observation.
Seed, Code, Harvest: Grow Your Own App with Tree of Thoughts!