lntzm's starred repositories

Awesome-Multimodal-Large-Language-Models

:sparkles::sparkles:Latest Advances on Multimodal Large Language Models

Otter

šŸ¦¦ Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.

Language:PythonLicense:MITStargazers:3558Issues:100Issues:160

Vim

[ICML 2024] Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

Language:PythonLicense:Apache-2.0Stargazers:2868Issues:30Issues:111

Transformer-Explainability

[CVPR 2021] Official PyTorch implementation for Transformer Interpretability Beyond Attention Visualization, a novel method to visualize classifications by Transformer based networks.

Language:Jupyter NotebookLicense:MITStargazers:1771Issues:21Issues:62

VideoX

VideoX: a collection of video cross-modal models

Language:PythonLicense:NOASSERTIONStargazers:968Issues:21Issues:112

VideoLLaMA2

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Language:PythonLicense:Apache-2.0Stargazers:756Issues:9Issues:82

PLLaVA

Official repository for the paper PLLaVA

MovieChat

[CVPR 2024] šŸŽ¬šŸ’­ chat with over 10K frames of video!

Language:PythonLicense:BSD-3-ClauseStargazers:499Issues:10Issues:75

SAM-6D

[CVPR2024] Code for "SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation".

UniVTG

[ICCV2023] UniVTG: Towards Unified Video-Language Temporal Grounding

Language:PythonLicense:MITStargazers:315Issues:6Issues:46

GroundingGPT

[ACL 2024] GroundingGPT: Language-Enhanced Multi-modal Grounding Model

Language:PythonLicense:Apache-2.0Stargazers:286Issues:14Issues:10

DEADiff

[CVPR 2024] Official implementation of "DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations"

Language:PythonLicense:Apache-2.0Stargazers:214Issues:11Issues:17

VTimeLLM

[CVPR'2024 Highlight] Official PyTorch implementation of the paper "VTimeLLM: Empower LLM to Grasp Video Moments".

Language:PythonLicense:NOASSERTIONStargazers:208Issues:2Issues:35

QD-DETR

Official pytorch repository for "QD-DETR : Query-Dependent Video Representation for Moment Retrieval and Highlight Detection" (CVPR 2023 Paper)

Language:PythonLicense:NOASSERTIONStargazers:196Issues:4Issues:44

Awesome_Long_Form_Video_Understanding

Awesome papers & datasets specifically focused on long-term videos.

TriDet

[CVPR2023] Code for the paper, TriDet: Temporal Action Detection with Relative Boundary Modeling

Language:PythonLicense:MITStargazers:161Issues:3Issues:37

HBI

[CVPR 2023 Highlight] Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning

Language:PythonLicense:Apache-2.0Stargazers:103Issues:4Issues:7

BACL

Balanced Classification: A Unified Framework for Long-Tailed Object Detection (TMM 2023)

Language:PythonLicense:Apache-2.0Stargazers:95Issues:4Issues:8

ProST

Progressive Spatio-Temporal Prototype Matching for Text-Video Retrieval --ICCV2023 Oral

Language:PythonLicense:Apache-2.0Stargazers:89Issues:3Issues:7

HiREST

Hierarchical Video-Moment Retrieval and Step-Captioning (CVPR 2023)

Language:PythonLicense:MITStargazers:89Issues:5Issues:12

MomentDiff

MomentDiff: Generative Video Moment Retrieval from Random to Real--NeurIPS 2023

Language:PythonLicense:NOASSERTIONStargazers:73Issues:4Issues:9

VideoTree

Code for paper "VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos"

Language:PythonLicense:MITStargazers:70Issues:2Issues:4

ViGA

"Video Moment Retrieval from Text Queries via Single Frame Annotation" in SIGIR 2022.

Language:PythonLicense:MITStargazers:64Issues:2Issues:5

DHVT

This is an official implementation of our NeurIPS 2022 paper "Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets".

Language:PythonLicense:Apache-2.0Stargazers:51Issues:3Issues:2

LocVTP

[ECCV 22] LocVTP: Video-Text Pre-training for Temporal Localization

Language:PythonLicense:Apache-2.0Stargazers:38Issues:2Issues:8

LPV

The official code of Linguistic More: Taking a Further Step toward Efficient and Accurate Scene Text Recognition (IJCAI2023)

Language:PythonLicense:MITStargazers:26Issues:3Issues:3

MS-DETR

An official implementation for MS-DETR in ACL'23

SSM

[IJCAI-2024] The official code of Self-Supervised Pre-training with Symmetric Superimposition Modeling for Scene Text Recognition

Language-Enhanced-CLIP-For-Multi-label-Image-Recognition

3rd Place, Visual Prompt Tuning Challenge @ CVPR 2023 HIT Workshop (2023)

Language:PythonLicense:MITStargazers:6Issues:2Issues:0