There are 14 repositories under multi-modality topic.
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
:sparkles::sparkles:Latest Advances on Multimodal Large Language Models
🏄 Scalable embedding, reasoning, ranking for images and sentences with CLIP
The Enterprise-Grade Production-Ready Multi-Agent Orchestration Framework. Website: https://swarms.ai
Simple command line tool for text to image generation using OpenAI's CLIP and Siren (Implicit neural representation network). Technique was originally created by https://twitter.com/advadnoun
🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
Algorithms and Publications on 3D Object Tracking
Chatbot Arena meets multi-modality! Multi-Modality Arena allows you to benchmark vision-language models side-by-side while providing images as inputs. Supports MiniGPT-4, LLaMA-Adapter V2, LLaVA, BLIP-2, and many more!
The open source implementation of Gemini, the model that will "eclipse ChatGPT" by Google
[CVPR'23] MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation
[CVPR 2023] Collaborative Diffusion
[CVPR 2025 Highlight] Official code for "Olympus: A Universal Task Router for Computer Vision Tasks"
An open-source implementation for training LLaVA-NeXT.
Effortless plugin and play Optimizer to cut model training costs by 50%. New optimizer that is 2x faster than Adam on LLMs.
Official repository for VisionZip (CVPR 2025)
[CVPR'24] RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
An official PyTorch implementation of the CRIS paper
MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech
Unifying Voxel-based Representation with Transformer for 3D Object Detection (NeurIPS 2022)
This repo contains the official code of our work SAM-SLR which won the CVPR 2021 Challenge on Large Scale Signer Independent Isolated Sign Language Recognition.
Official code for NeurIPS2023 paper: CoDA: Collaborative Novel Box Discovery and Cross-modal Alignment for Open-vocabulary 3D Object Detection
Embed arbitrary modalities (images, audio, documents, etc) into large language models.
[ESSD 2025] BRIGHT: A globally distributed multimodal VHR dataset for all-weather disaster response
[CVPR 2024] Prompt Highlighter: Interactive Control for Multi-Modal LLMs
(NeurIPS 2022 CellSeg Challenge - 1st Winner) Open source code for "MEDIAR: Harmony of Data-Centric and Model-Centric for Multi-Modality Microscopy"
An all-new Language Model That Processes Ultra-Long Sequences of 100,000+ Ultra-Fast
Seed, Code, Harvest: Grow Your Own App with Tree of Thoughts!
This repository contains the training, inference, evaluation code for SpeechLLM models and details about the model releases on huggingface.
Implementation of MambaByte in "MambaByte: Token-free Selective State Space Model" in Pytorch and Zeta
Implementation of MoE Mamba from the paper: "MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts" in Pytorch and Zeta