linzhiqiu's starred repositories

Cosmos

New repo collection for NVIDIA Cosmos: https://github.com/nvidia-cosmos

License:Apache-2.0Stargazers:8055Issues:85Issues:0

mochi

The best OSS video generation models, created by Genmo

Language:PythonLicense:Apache-2.0Stargazers:3424Issues:44Issues:117

lmms-eval

One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks

Language:PythonLicense:NOASSERTIONStargazers:3101Issues:6Issues:377

Awesome-LLM-Post-training

Awesome Reasoning LLM Tutorial/Survey/Guide

ml-aim

This repository provides the code and model checkpoints for AIMv1 and AIMv2 research projects.

Language:PythonLicense:NOASSERTIONStargazers:1366Issues:27Issues:32

pasa

PaSa -- an advanced paper search agent powered by large language models. It can autonomously make a series of decisions, including invoking search tools, reading papers, and selecting relevant references, to ultimately obtain comprehensive and accurate results for complex scholarly queries.

Language:PythonLicense:Apache-2.0Stargazers:1338Issues:9Issues:13

Sa2VA

🔥 Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Language:PythonLicense:Apache-2.0Stargazers:1259Issues:23Issues:52

mega-sam

Code for the project "MegaSaM: Accurate, Fast and Robust Structure and Motion from Casual Dynamic Videos"

Language:PythonLicense:Apache-2.0Stargazers:1048Issues:51Issues:35

VideoLLaMA3

Frontier Multimodal Foundation Models for Image and Video Understanding

Language:Jupyter NotebookLicense:Apache-2.0Stargazers:985Issues:12Issues:85

LanguageBind

【ICLR 2024🔥】 Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

Language:PythonLicense:MITStargazers:828Issues:14Issues:67

tarsier

Tarsier -- a family of large-scale video-language models, which is designed to generate high-quality video descriptions , together with good capability of general video understanding.

Language:PythonLicense:Apache-2.0Stargazers:470Issues:8Issues:32

LongVU

[ICML 2025] Official PyTorch implementation of LongVU

Language:PythonLicense:Apache-2.0Stargazers:398Issues:4Issues:44

PerspectiveFields

[CVPR 2023 Highlight] Perspective Fields for Single Image Camera Calibration

Language:Jupyter NotebookLicense:NOASSERTIONStargazers:272Issues:7Issues:19

superclass

[NeurIPS 2024] Classification Done Right for Vision-Language Pre-Training

Language:PythonLicense:Apache-2.0Stargazers:216Issues:7Issues:13

NaturalBench

🚀 [NeurIPS24] Make Vision Matter in Visual-Question-Answering (VQA)! Introducing NaturalBench, a vision-centric VQA benchmark (NeurIPS'24) that challenges vision-language models with simple questions about natural imagery.

Language:PythonStargazers:85Issues:10Issues:0

MAmmoTH-VL

(ACL 2025) MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale

Language:PythonStargazers:48Issues:0Issues:2

viddiff

[ICLR 2025] Video Action Differencing

Language:PythonStargazers:44Issues:1Issues:0

MotionBench

Official code for MotionBench (CVPR 2025)

Language:PythonLicense:Apache-2.0Stargazers:32Issues:8Issues:4

VidComposition

[CVPR 2025] VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?

License:Apache-2.0Stargazers:27Issues:0Issues:0

GPS2Pix

[CVPR 2025] GPS as a Control Signal for Image Generation

Stargazers:15Issues:0Issues:0

SAVs

Official Codebase for "Generative Multimodal Model Features Are Discriminative Vision-Language Classifiers"

Language:PythonStargazers:13Issues:0Issues:0
Language:PythonLicense:NOASSERTIONStargazers:11Issues:2Issues:1
Language:PythonStargazers:11Issues:0Issues:0
License:MITStargazers:6Issues:0Issues:0
Language:PythonStargazers:4Issues:1Issues:0
Language:Jupyter NotebookLicense:MITStargazers:4Issues:1Issues:0

T2I-Probology

Experimental results + resources for probing compositional structure in generative text-to-image (T2I) models

License:GPL-3.0Stargazers:3Issues:1Issues:0

cece-vlm

Code for Natural Language Inference Improves Compositionality in Vision-Language Models

License:MITStargazers:3Issues:3Issues:0

t2v_metrics

Evaluating text-to-image/video/3D models with VQAScore

Language:Jupyter NotebookLicense:Apache-2.0Stargazers:2Issues:0Issues:0