There are 16 repositories under multi-modal topic.
ModelScope: bring the notion of Model-as-a-Service to life.
Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch
Chinese and English multimodal conversational language model | 多模态中英双语对话语言模型
Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
A C#/.NET library to run LLM models (🦙LLaMA/LLaVA) on your local device efficiently.
A robust, all-in-one GPT interface for Discord. ChatGPT-style conversations, image generation, AI-moderation, custom indexes/knowledgebase, youtube summarizer, and more!
Mixture-of-Experts for Large Vision-Language Models
A one-stop data processing system to make data higher-quality, juicier, and more digestible for LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大语言模型提供更高质量、更丰富、更易”消化“的数据!
Project Page for "LISA: Reasoning Segmentation via Large Language Model"
🥂 Gracefully face hCaptcha challenge with MoE(ONNX) embedded solution.
Recent Transformer-based CV and related works.
[NeurIPS 2023] MotionGPT: Human Motion as a Foreign Language, a unified motion-language generation model using LLMs
推荐/广告/搜索领域工业界经典以及最前沿论文集合。A collection of industry classics and cutting-edge papers in the field of recommendation/advertising/search.
The TypeScript library for building AI applications.
A curated list of Visual Question Answering(VQA)(Image/Video Question Answering),Visual Question Generation ,Visual Dialog ,Visual Commonsense Reasoning and related area.
FarmVibes.AI: Multi-Modal GeoSpatial ML Models for Agriculture and Sustainability
【ICLR 2024🔥】 Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
Start building LLM-empowered multi-agent applications in an easier way.
Open-source evaluation toolkit of large vision-language models (LVLMs), support GPT-4v, Gemini, QwenVLPlus, 30+ HF models, 15+ benchmarks
Official implementation of the paper "You Only Need Adversarial Supervision for Semantic Image Synthesis" (ICLR 2021)
code for paper "Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion" in the conference of IJCAI 2021
[ECCV 2020 Spotlight] A Simple and Versatile Framework for Image-to-Image Translation
FashionCLIP is a CLIP-like model fine-tuned for the fashion domain.