ZhangYuanhan-AI

ICLR2024 Spotlight: curation/training code, metadata, distribution and pre-trained models for MetaCLIP; CVPR 2024: MoDE: CLIP Data Experts via Clustering

Language:PythonNOASSERTION1172 12 27

ToMe

A method to increase the speed and lower the memory footprint of existing vision transformers.

Language:PythonNOASSERTION936 111 38

VideoMamba

[ECCV2024] VideoMamba: State Space Model for Efficient Video Understanding

Language:PythonApache-2.0786 12 87

ring-flash-attention

Ring attention implementation with flash attention

Language:PythonMIT536 10 32

ring-attention-pytorch

Implementation of 💍 Ring Attention, from Liu et al. at Berkeley AI, in Pytorch

Language:PythonMIT453 11 14

prismatic-vlms

A flexible and efficient codebase for training visually-conditioned language models (VLMs)

Language:PythonMIT415 12 38

Video-MME

✨✨Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

370 5 29

ttt-lm-jax

Official JAX implementation of Learning to (Learn at Test Time): RNNs with Expressive Hidden States

Language:Python345 9 10

scaling_on_scales

When do we not need larger vision models?

Language:PythonMIT316 7 14

LongVA

Long Context Transfer from Language to Vision

Language:PythonApache-2.0295 8 21

Dataset

News: the 10k dataset is ready for download.

Language:HTMLNOASSERTION274 13 31

TimeChat

[CVPR 2024] TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

Language:PythonBSD-3-Clause267 5 45

VideoRecap

Language:PythonMIT155 4 14

Vript

Language:PythonNOASSERTION113 1 8

LLaVA-Hound-DPO

Language:Python111 5 13

video_captioning_datasets

Summary about Video-to-Text datasets. This repository is part of the review paper *Bridging Vision and Language from the Video-to-Text Perspective: A Comprehensive Review*

Language:Jupyter Notebook109 3 1

HD-VG-130M

The HD-VG-130M Dataset

106 6 5

PSG4D

4D Panoptic Scene Graph Generation (NeurIPS'23 Spotlight)

Language:Python84 4 6

Genixer

(ECCV 2024) Empowering Multimodal Large Language Model as a Powerful Data Generator

Language:Python77 30

MATH-V

MATH-Vision dataset and code to measure Multimodal Mathematical Reasoning capabilities.

Language:PythonMIT54 1 2

LongVideoBench

Official Dataloader and Evaluation Scripts for LongVideoBench.

Language:Python5100

MMLongBench-Doc

Official Repository of MMLONGBENCH-DOC: Benchmarking Long-context Document Understanding with Visualizations

Language:PythonApache-2.04800

CVRR-Evaluation-Suite

Official repository of paper titled "How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs".

Language:PythonCC-BY-4.03900