wh0330's starred repositories

MA-LMM

(2024CVPR) MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Language:PythonLicense:MITStargazers:136Issues:0Issues:0

VTimeLLM

[CVPR'2024 Highlight] Official PyTorch implementation of the paper "VTimeLLM: Empower LLM to Grasp Video Moments".

Language:PythonLicense:NOASSERTIONStargazers:136Issues:0Issues:0

Awesome-Multimodal-Large-Language-Models

:sparkles::sparkles:Latest Papers and Datasets on Multimodal Large Language Models, and Their Evaluation.

Stargazers:9573Issues:0Issues:0

AppAgent

AppAgent: Multimodal Agents as Smartphone Users, an LLM-based multimodal agent framework designed to operate smartphone apps.

Language:PythonLicense:MITStargazers:4415Issues:0Issues:0

VGT

Video Graph Transformer for Video Question Answering (ECCV'22)

Language:PythonLicense:Apache-2.0Stargazers:43Issues:0Issues:0

PDVC

End-to-End Dense Video Captioning with Parallel Decoding (ICCV 2021)

Language:PythonLicense:MITStargazers:190Issues:0Issues:0

hgr_v2t

Code accompanying the paper "Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning".

Language:PythonLicense:MITStargazers:206Issues:0Issues:0

ClipBERT

[CVPR 2021 Best Student Paper Honorable Mention, Oral] Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks.

Language:PythonLicense:MITStargazers:689Issues:0Issues:0

bottom-up-attention.pytorch

A PyTorch reimplementation of bottom-up-attention models

Language:Jupyter NotebookLicense:Apache-2.0Stargazers:289Issues:0Issues:0

SceneGraphParser

A python toolkit for parsing captions (in natural language) into scene graphs (as symbolic representations).

Language:PythonLicense:MITStargazers:511Issues:0Issues:0

HRNAT

Hierarchical Representation Network with AuxiliaryTasks for Video Captioning and Video QuestionAnswering

Language:PythonStargazers:9Issues:0Issues:0

Graph-Optimal-Transport

Code for ICML 2020 "Graph Optimal Transport for Cross-Domain Alignment"

Language:PythonLicense:MITStargazers:149Issues:0Issues:0

SUTD-TrafficQA

[CVPR2021] SUTD-TrafficQA: A Question Answering Benchmark and an Efficient Network for Video Reasoning over Traffic Events

Language:JavaScriptStargazers:43Issues:0Issues:0

vRGV

Visual Relation Grounding in Videos (ECCV'20, Spotlight)

Language:PythonStargazers:57Issues:0Issues:0

VidHOI

Official implementation of "ST-HOI: A Spatial-Temporal Baseline for Human-Object Interaction Detection in Videos" (ACM ICMRW 2021)

Language:Jupyter NotebookLicense:Apache-2.0Stargazers:46Issues:0Issues:0

NExT-QA

NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions (CVPR'21)

Language:PythonLicense:MITStargazers:99Issues:0Issues:0

cheatsheets

Official Matplotlib cheat sheets

Language:PythonLicense:BSD-2-ClauseStargazers:7270Issues:0Issues:0

FINCH-Clustering

Source Code for FINCH Clustering Algorithm

Language:Jupyter NotebookLicense:NOASSERTIONStargazers:313Issues:0Issues:0

mmt

Multi-Modal Transformer for Video Retrieval

Language:PythonLicense:Apache-2.0Stargazers:249Issues:0Issues:0

Mixture-of-Embedding-Experts

Mixture-of-Embeddings-Experts

Language:PythonLicense:Apache-2.0Stargazers:119Issues:0Issues:0

NeXtVLAD.pytorch

Pytorch implementation of NetVlad for classification on UCF101

Language:PythonStargazers:26Issues:0Issues:0

DRN

Dense Regression Network for Video Grounding (CVPR2020)

Language:PythonStargazers:50Issues:0Issues:0

CCL

PyTorch Implementation on Paper [CVPR2021]Distilling Audio-Visual Knowledge by Compositional Contrastive Learning

Language:PythonLicense:Apache-2.0Stargazers:84Issues:0Issues:0

CrossViT-pytorch

Implementation of CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

Language:PythonLicense:MITStargazers:174Issues:0Issues:0

SimCSE

[EMNLP 2021] SimCSE: Simple Contrastive Learning of Sentence Embeddings https://arxiv.org/abs/2104.08821

Language:PythonLicense:MITStargazers:3272Issues:0Issues:0

PoseFormer

The project is an official implementation of our paper "3D Human Pose Estimation with Spatial and Temporal Transformers".

Language:PythonStargazers:471Issues:0Issues:0

BiST

Code for the paper BiST: Bi-directional Spatio-Temporal Reasoning for Video-Grounded Dialogues (EMNLP20)

Language:PythonStargazers:10Issues:0Issues:0

visdial_conv

This repository contains code used in our ACL'20 paper History for Visual Dialog: Do we really need it?

Language:Jupyter NotebookLicense:BSD-3-ClauseStargazers:33Issues:0Issues:0