There are 21 repositories under multi-modal topic.
ModelScope: bring the notion of Model-as-a-Service to life.
Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch
Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的可商用开源多模态对话模型
Chinese and English multimodal conversational language model | 多模态中英双语对话语言模型
Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.
Start building LLM-empowered multi-agent applications in an easier way.
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
A C#/.NET library to run LLM (🦙LLaMA/LLaVA) on your local device efficiently.
Mixture-of-Experts for Large Vision-Language Models
A robust, all-in-one GPT interface for Discord. ChatGPT-style conversations, image generation, AI-moderation, custom indexes/knowledgebase, youtube summarizer, and more!
A one-stop data processing system to make data higher-quality, juicier, and more digestible for (multimodal) LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据!
Project Page for "LISA: Reasoning Segmentation via Large Language Model"
🥂 Gracefully face hCaptcha challenge with MoE(ONNX) embedded solution.
[NeurIPS 2023] MotionGPT: Human Motion as a Foreign Language, a unified motion-language generation model using LLMs
Recent Transformer-based CV and related works.
推荐/广告/搜索领域工业界经典以及最前沿论文集合。A collection of industry classics and cutting-edge papers in the field of recommendation/advertising/search.
The TypeScript library for building AI applications.
[pip install medmnist] 18x Standardized Datasets for 2D and 3D Biomedical Image Classification
Open-source evaluation toolkit of large vision-language models (LVLMs), support ~100 VLMs, 30+ benchmarks
FarmVibes.AI: Multi-Modal GeoSpatial ML Models for Agriculture and Sustainability
A curated list of Visual Question Answering(VQA)(Image/Video Question Answering),Visual Question Generation ,Visual Dialog ,Visual Commonsense Reasoning and related area.
【ICLR 2024🔥】 Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
This project is the official implementation of 'LLMGA: Multimodal Large Language Model based Generation Assistant', ECCV2024
This repository collects papers for "A Survey on Knowledge Distillation of Large Language Models". We break down KD into Knowledge Elicitation and Distillation Algorithms, and explore the Skill & Vertical Distillation of LLMs.
Source code for "Taming Visually Guided Sound Generation" (Oral at the BMVC 2021)