Beast code in Giters

wh0330's starred repositories

MA-LMM

(2024CVPR) MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Language:PythonMIT13600

VTimeLLM

[CVPR'2024 Highlight] Official PyTorch implementation of the paper "VTimeLLM: Empower LLM to Grasp Video Moments".

Language:PythonNOASSERTION13600

Awesome-Multimodal-Large-Language-Models

:sparkles::sparkles:Latest Papers and Datasets on Multimodal Large Language Models, and Their Evaluation.

957300

AppAgent

AppAgent: Multimodal Agents as Smartphone Users, an LLM-based multimodal agent framework designed to operate smartphone apps.

Language:PythonMIT441500

VGT

Video Graph Transformer for Video Question Answering (ECCV'22)

Language:PythonApache-2.04300

PDVC

End-to-End Dense Video Captioning with Parallel Decoding (ICCV 2021)

Language:PythonMIT19000

hgr_v2t

Code accompanying the paper "Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning".

Language:PythonMIT20600

ClipBERT

[CVPR 2021 Best Student Paper Honorable Mention, Oral] Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks.

Language:PythonMIT68900

bottom-up-attention.pytorch

A PyTorch reimplementation of bottom-up-attention models

Language:Jupyter NotebookApache-2.028900

SceneGraphParser

A python toolkit for parsing captions (in natural language) into scene graphs (as symbolic representations).

Language:PythonMIT51100

HRNAT

Hierarchical Representation Network with AuxiliaryTasks for Video Captioning and Video QuestionAnswering

Language:Python900

Graph-Optimal-Transport

Code for ICML 2020 "Graph Optimal Transport for Cross-Domain Alignment"

Language:PythonMIT14900

SUTD-TrafficQA

[CVPR2021] SUTD-TrafficQA: A Question Answering Benchmark and an Efficient Network for Video Reasoning over Traffic Events

Language:JavaScript4300

vRGV

Visual Relation Grounding in Videos (ECCV'20, Spotlight)

Language:Python5700

VidHOI

Official implementation of "ST-HOI: A Spatial-Temporal Baseline for Human-Object Interaction Detection in Videos" (ACM ICMRW 2021)

Language:Jupyter NotebookApache-2.04600

NExT-QA

NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions (CVPR'21)

Language:PythonMIT9900

cheatsheets

Official Matplotlib cheat sheets

Language:PythonBSD-2-Clause727000

FINCH-Clustering

Source Code for FINCH Clustering Algorithm

Language:Jupyter NotebookNOASSERTION31300

mmt

Multi-Modal Transformer for Video Retrieval

Language:PythonApache-2.024900

Mixture-of-Embedding-Experts

Mixture-of-Embeddings-Experts

Language:PythonApache-2.011900

NeXtVLAD.pytorch

Pytorch implementation of NetVlad for classification on UCF101

Language:Python2600

DRN

Dense Regression Network for Video Grounding (CVPR2020)

Language:Python5000

CCL

PyTorch Implementation on Paper [CVPR2021]Distilling Audio-Visual Knowledge by Compositional Contrastive Learning

Language:PythonApache-2.08400

CrossViT-pytorch

Implementation of CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

Language:PythonMIT17400

CVPR-2021-Papers

254100

SimCSE

[EMNLP 2021] SimCSE: Simple Contrastive Learning of Sentence Embeddings https://arxiv.org/abs/2104.08821

Language:PythonMIT327200

Multi-HT100M

5200

PoseFormer

The project is an official implementation of our paper "3D Human Pose Estimation with Spatial and Temporal Transformers".

Language:Python47100

BiST

Code for the paper BiST: Bi-directional Spatio-Temporal Reasoning for Video-Grounded Dialogues (EMNLP20)

Language:Python1000

visdial_conv

This repository contains code used in our ACL'20 paper History for Visual Dialog: Do we really need it?

Language:Jupyter NotebookBSD-3-Clause3300