There are 31 repositories under multi-modal topic.
MiniCPM-V 4.5: A GPT-4o Level MLLM for Single Image, Multi Image and High-FPS Video Understanding on Your Phone
AgentScope: Agent-Oriented Programming for Building LLM Applications
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop.ai
Open-source framework for conversational voice AI agents
ModelScope: bring the notion of Model-as-a-Service to life.
AI suite powered by state-of-the-art models and providing advanced AI/AGI functions. Includes AI personas, AGI functions, world-class Beam multi-model chats, text-to-image, voice, response streaming, code highlighting and execution, PDF import, presets for developers, much more. Deploy on-prem or in the cloud.
Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch
Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.
Data processing for and with foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷
Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai
OmniGen: Unified Image Generation. https://arxiv.org/pdf/2409.11340
Chinese and English multimodal conversational language model | 多模态中英双语对话语言模型
A C#/.NET library to run LLM (🦙LLaMA/LLaVA) on your local device efficiently.
【EMNLP 2024🔥】Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
Project Page for "LISA: Reasoning Segmentation via Large Language Model"
【TMM 2025🔥】 Mixture-of-Experts for Large Vision-Language Models
推荐/广告/搜索领域工业界经典以及最前沿论文集合。A collection of industry classics and cutting-edge papers in the field of recommendation/advertising/search.
A robust, all-in-one GPT interface for Discord. ChatGPT-style conversations, image generation, AI-moderation, custom indexes/knowledgebase, youtube summarizer, and more!
[NeurIPS 2023] MotionGPT: Human Motion as a Foreign Language, a unified motion-language generation model using LLMs
Recent Transformer-based CV and related works.
The TypeScript library for building AI applications.
[pip install medmnist] 18x Standardized Datasets for 2D and 3D Biomedical Image Classification
Pytorch implementation of Transfusion, "Predict the Next Token and Diffuse Images with One Multi-Modal Model", from MetaAI
This repository collects papers for "A Survey on Knowledge Distillation of Large Language Models". We break down KD into Knowledge Elicitation and Distillation Algorithms, and explore the Skill & Vertical Distillation of LLMs.
【ICLR 2024🔥】 Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
Use late-interaction multi-modal models such as ColPali in just a few lines of code.