multi-modality

There are 14 repositories under multi-modality topic.

haotian-liu / LLaVA
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
chatbot chatgpt foundation-models gpt-4 instruction-tuning llama llama-2 llama2 llava multi-modality multimodal vision-language-model visual-language-learning
Language:Python 23912
BradyFU / Awesome-Multimodal-Large-Language-Models
:sparkles::sparkles:Latest Advances on Multimodal Large Language Models
chain-of-thought in-context-learning instruction-following instruction-tuning large-language-models large-vision-language-model large-vision-language-models multi-modality multimodal-chain-of-thought multimodal-in-context-learning multimodal-instruction-tuning multimodal-large-language-models visual-instruction-tuning
16641
clip-as-service
jina-ai / clip-as-service
🏄 Scalable embedding, reasoning, ranking for images and sentences with CLIP
bert bert-as-service clip-as-service clip-model cross-modal-retrieval cross-modality deep-learning image2vec multi-modality neural-search onnx openai pytorch sentence-encoding sentence2vec
Language:Python 12763
swarms
kyegomez / swarms
The Enterprise-Grade Production-Ready Multi-Agent Orchestration Framework. Website: https://swarms.ai
artificial-intelligence attention-mechanism gpt4 langchain machine-learning multi-modal-imaging multi-modality multimodal swarms transformer-models agents prompt-engineering prompt-toolkit prompting tree-of-thoughts ai chatgpt gpt4all huggingface langchain-python
Language:Python 5376
lucidrains / deep-daze
Simple command line tool for text to image generation using OpenAI's CLIP and Siren (Implicit neural representation network). Technique was originally created by https://twitter.com/advadnoun
artificial-intelligence deep-learning transformers siren implicit-neural-representation text-to-image multi-modality
Language:Python 4342
Otter
EvolvingLMMs-Lab / Otter
🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.
artificial-inteligence chatgpt deep-learning embodied-ai foundation-models gpt-4 instruction-tuning large-scale-models machine-learning multi-modality visual-language-learning
Language:Python 3276
InternLM / InternLM-XComposer
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
chatgpt foundation gpt gpt-4 instruction-tuning language-model large-language-model large-vision-language-model llm mllm multi-modality multimodal supervised-finetuning vision-language-model vision-transformer visual-language-learning
Language:Python 2899
DLR-RM / 3DObjectTracking
Algorithms and Publications on 3D Object Tracking
pose-estimation computer-vision accv2020 paper cvpr2022 ijcv iros2023 articulated tpami real-time object-tracking multi-body multi-modality rgbd tracking
Language:C++ 936
OpenBMB / VisRAG
Parsing-free RAG supported by VLMs
document-retrieval document-understanding multi-modal multi-modality rag retrieval retrieval-augmented-generation vision-language-model
Language:Python 849
NVlabs / Long-RL
Long-RL: Scaling RL to Long Sequences (NeurIPS 2025)
efficient-ai large-language-models long-sequence multi-modality reinforcement-learning sequence-parallelism
Language:Python 649
OpenGVLab / Multi-Modality-Arena
Chatbot Arena meets multi-modality! Multi-Modality Arena allows you to benchmark vision-language models side-by-side while providing images as inputs. Supports MiniGPT-4, LLaMA-Adapter V2, LLaVA, BLIP-2, and many more!
chat chatbot chatgpt gradio large-language-models llms vqa multi-modality vision-language-model
Language:Python 546
LSXI7 / MINIMA
[CVPR 2025] MINIMA: Modality Invariant Image Matching
cvpr2025 image-matching multi-modality
Language:Python 518
kyegomez / Gemini
The open source implementation of Gemini, the model that will "eclipse ChatGPT" by Google
ai artificial-intelligence gemini gpt4 machine-learning ml multi-modality multimodla
Language:Python 460
researchmm / MM-Diffusion
[CVPR'23] MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation
audio-generation content-creation diffusion-models multi-modality video-generation
Language:Python 445
ziqihuangg / Collaborative-Diffusion
[CVPR 2023] Collaborative Diffusion
diffusion-models face-editing face-generation image-editing image-generation latent-diffusion-models multi-modality aigc gen-ai stable-diffusion
Language:Python 431
yuanze-lin / Olympus
[CVPR 2025 Highlight] Official code for "Olympus: A Universal Task Router for Computer Vision Tasks"
chatbot chatgpt deeplearning foundation-models instruction-tuning llava llms mllms multi-modality multimodal pytorch vision-language-model
Language:Python 427
xiaoachen98 / Open-LLaVA-NeXT
An open-source implementation for training LLaVA-NeXT.
chatbot chatgpt gpt-4 gpt4o large-multimodal-models llama llama3 llava multi-modality multimodal vision-language-model visual-language-learning llava-next
Language:Python 425
kyegomez / Sophia
Effortless plugin and play Optimizer to cut model training costs by 50%. New optimizer that is 2x faster than Adam on LLMs.
artificial-intelligence chatgpt deep-learning multi-modality neural-network optimizer
Language:Python 383
dvlab-research / VisionZip
Official repository for VisionZip (CVPR 2025)
efficiency multi-modality vision-language-model vlms
Language:Python 369
RLHF-V / RLHF-V
[CVPR'24] RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
chatbot gpt-4 llama multi-modality multimodal rlhf-v visual-language-learning
Language:Python 295
DerrickWang005 / CRIS.pytorch
An official PyTorch implementation of the CRIS paper
contrastive-learning multi-modality referring-image-segmentation
Language:Python 278
ZwwWayne / mmMOT
[ICCV2019] Robust Multi-Modality Multi-Object Tracking
mot iccv2019 multi-modality
Language:Python 258
dvlab-research / MGM-Omni
MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech
audio-language-model multi-modal-large-language-model multi-modality multimodal text-to-speech
Language:Python 253
dvlab-research / UVTR
Unifying Voxel-based Representation with Transformer for 3D Object Detection (NeurIPS 2022)
3d-detection multi-modality pytorch
Language:Python 245
CVPR21Chal-SLR
jackyjsy / CVPR21Chal-SLR
This repo contains the official code of our work SAM-SLR which won the CVPR 2021 Challenge on Large Scale Signer Independent Isolated Sign Language Recognition.
sign-language-recognition sign-language-recognition-system multi-modality skeleton-features cvpr2021
Language:Python 219
yangcaoai / CoDA_NeurIPS2023
Official code for NeurIPS2023 paper: CoDA: Collaborative Novel Box Discovery and Cross-modal Alignment for Open-vocabulary 3D Object Detection
3d-detection 3d-vision multi-modality open-vocabulary artificial-intelligence deep-learning transformer detection
Language:Jupyter Notebook 212
sshh12 / multi_token
Embed arbitrary modalities (images, audio, documents, etc) into large language models.
large-context large-language-models large-multimodal-models llava llm multi-modality multimodal vision-language-model
Language:Python 187
ChenHongruixuan / BRIGHT
[ESSD 2025] BRIGHT: A globally distributed multimodal VHR dataset for all-weather disaster response
artificial-intelligence deep-learning disaster-management disaster-recovery disaster-relief disaster-response earth-observation multi-modality multi-temporal remote-sensing satellite-imagery semantic-segmentation building-damage-assessment synthetic-aperture-radar very-high-resolution-satellite-imagery building-damage-mapping ieee ieee-grss ieee-grss-dfc
Language:Python 183
jina-ai / rungpt
An open-source cloud-native of large multi-modal models (LMMs) serving framework.
flamingo gpt-4 large-language-models llama multi-modality opengpt transformers large-multimadality-models self-hosting llm-hosting llm-serve lmm-serve
Language:Python 167
dvlab-research / Prompt-Highlighter
[CVPR 2024] Prompt Highlighter: Interactive Control for Multi-Modal LLMs
llm-inference multi-modality text-generation
Language:Python 155
Lee-Gihun / MEDIAR
(NeurIPS 2022 CellSeg Challenge - 1st Winner) Open source code for "MEDIAR: Harmony of Data-Centric and Model-Centric for Multi-Modality Microscopy"
biomedical cell-biology cell-segmentation instance-segmentation monai multi-modality neurips-2022 pytorch pytorch-implementation pytorch-segmentation vision-transformer miscroscopy multi-resolution
Language:Python 153
kyegomez / Andromeda
An all-new Language Model That Processes Ultra-Long Sequences of 100,000+ Ultra-Fast
artificial-intelligence deep-learning gpt-4 language-model large-language-models neural-networks transformer agi artificial-general-intelligence artificial-intelligence-algorithms multi-modality multimodal
Language:Python 152
kyegomez / the-compiler
Seed, Code, Harvest: Grow Your Own App with Tree of Thoughts!
agora artficial-intelligence autogpt chatgpt prompt-engineering tree-of-thoughts chain-of-thought deep-learning deep-learning-algorithms multi-modal-fusion multi-modality multimodal-deep-learning reinforcement-learning
Language:Python 144
skit-ai / SpeechLLM
This repository contains the training, inference, evaluation code for SpeechLLM models and details about the model releases on huggingface.
conversational-ai llm multi-modal-llms multi-modality speech
Language:Python 125
kyegomez / MambaByte
Implementation of MambaByte in "MambaByte: Token-free Selective State Space Model" in Pytorch and Zeta
ai artificial-intelligence gpt4v machine-learning mamba megabyte ml multi-modality tokenizer
Language:Python 123
kyegomez / MoE-Mamba
Implementation of MoE Mamba from the paper: "MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts" in Pytorch and Zeta
ai ml moe multi-modal-fusion multi-modality swarms
Language:Python 114