Aaron Han's repositories
ACoLP
Open Set Video HOI detection from Action-centric Chain-of-Look Prompting, ICCV2023
Chat-UniVi
[CVPR 2024🔥] Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
ChatGLM-6B
ChatGLM-6B: An Open Bilingual Dialogue Language Model | 开源双语对话语言模型
EVA
EVA Series: Visual Representation Fantasies from BAAI
explore-eqa
Public release for "Explore until Confident: Efficient Exploration for Embodied Question Answering"
flash-attention
Fast and memory-efficient exact attention
Glance-Focus
This repo contains source code for Glance and Focus: Memory Prompting for Multi-Event Video Question Answering (Accepted in NeurIPS 2023)
InvReg
Invariant Feature Regularization for Fair Face Recognition (ICCV'23)
LangRepo
Language Repository for Long Video Understanding
LAVIS
LAVIS - A One-stop Library for Language-Vision Intelligence
LLaVA
[NeurIPS'23 Oral] Visual Instruction Tuning: LLaVA (Large Language-and-Vision Assistant) built towards GPT-4V level capabilities.
LLM-Adapters
Code for our EMNLP 2023 Paper: "LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models"
LLoVi
Official implementation for "A Simple LLM Framework for Long-Range Video Question-Answering"
LSTP-Chat
A Video Chat Agent with Temporal Prior
MA-LMM
(2024CVPR) MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
memorizing-transformers-pytorch
Implementation of Memorizing Transformers (ICLR 2022), attention net augmented with indexing and retrieval of memories using approximate nearest neighbors, in Pytorch
mm-cot
Official implementation for "Multimodal Chain-of-Thought Reasoning in Language Models" (stay tuned and more will be updated)
MovieChat
[CVPR 2024] 🎬💭 chat with over 10K frames of video!
NExT-GQA
Can I Trust Your Answer? Visually Grounded VideoQA (Accepted to CVPR'24)
self-rag
This includes the original implementation of SELF-RAG: Learning to Retrieve, Generate and Critique through self-reflection by Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi.
SeViLA
Self-Chained Image-Language Model for Video Localization and Question Answering
StatisticalLearning_USTC
Statistical Learning course in USTC. 中科大统计学习(刘东)课程复习资料。
Video-ChatGPT
"Video-ChatGPT" is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.
VideoTree
Code for paper "VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos"
VidToMe
Official Pytorch Implementation for "VidToMe: Video Token Merging for Zero-Shot Video Editing" (CVPR 2024)
VTimeLLM
Official PyTorch implementation of the paper "VTimeLLM: Empower LLM to Grasp Video Moments".