JosephPai

Zechen Bai's starred repositories

VidProM

[NeurIPS 2024] VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models

10300

VideoLISA

[NeurlPS 2024] One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos

1300

Awesome-Unified-Multimodal-Models

📖 This is a repository for organizing papers, codes and other resources related to unified multimodal models.

15200

AdaSlot

Official implementation of the CVPR'24 paper [Adaptive Slot Attention: Object Discovery with Dynamic Slot Number]

Language:PythonApache-2.02000

Show-o

Repository for Show-o, One Single Transformer to Unify Multimodal Understanding and Generation.

Language:PythonApache-2.088200

AI-Scientist

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery 🧑‍🔬

Language:Jupyter NotebookApache-2.0769000

MINT-1T

MINT-1T: A one trillion token multimodal interleaved dataset.

74000

The repository provides code for running inference with the Meta Segment Anything Model 2 (SAM 2), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.

Language:Jupyter NotebookApache-2.01129600

SOLO

Public code repo for paper "A Single Transformer for Scalable Vision-Language Modeling"

Language:Jupyter NotebookApache-2.010500

fucking-algorithm

刷算法全靠套路，认准 labuladong 就够了！English version supported! Crack LeetCode, not only how, but also why.

Language:Markdown12526100

Maskgit-pytorch

Language:Jupyter NotebookMIT15200

ShiArthur03

Language:MATLABGPL-3.01037500

SliME

✨✨Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models

Language:PythonApache-2.013200

enhancing-transformers

An unofficial implementation of both ViT-VQGAN and RQ-VAE in Pytorch

Language:PythonMIT28000

cambrian

Cambrian-1 is a family of multimodal LLMs with a vision-centric design.

Language:PythonApache-2.0170400

chameleon

Repository for Meta Chameleon, a mixed-modal early-fusion foundation model from FAIR.

Language:PythonNOASSERTION177800

Open-MAGVIT2

Open-MAGVIT2: Democratizing Autoregressive Visual Generation

Language:PythonApache-2.063100

lmms-eval

Accelerating the development of large multimodal models (LMMs) with lmms-eval

Language:PythonNOASSERTION140500

Awesome-World-Model

Collect some World Models for Autonomous Driving papers.

42600

gpt-computer-assistant

Intelligence development framework in python for your product like Apple Intelligence

Language:PythonMIT520800

schedule_free

Schedule-Free Optimization in PyTorch

Language:PythonApache-2.0183500

World-Models-Autonomous-Driving-Latest-Survey

A curated list of world models for autonomous driving. Keep updated.

15300

supervision

We write your reusable computer vision tools. 💜

Language:PythonMIT2282500

Lumina-T2X

Lumina-T2X is a unified framework for Text to Any Modality Generation

Language:PythonMIT203700

HALC

[ICML 2024] Official implementation for "HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding"

Language:PythonMIT6600

Awesome-MLLM-Hallucination

📖 A curated list of resources dedicated to hallucination of multimodal large language models (MLLM).

38900

LLaVA-pp

🔥🔥 LLaVA++: Extending LLaVA with Phi-3 and LLaMA-3 (LLaVA LLaMA-3, LLaVA Phi-3)

Language:Python79800

XMem2

A tool for efficient semi-supervised video object segmentation (great results with minimal manual labor) and a dataset for benchmarking

Language:PythonGPL-3.017400

Tracking-Anything-with-DEVA

[ICCV 2023] Tracking Anything with Decoupled Video Segmentation

Language:PythonNOASSERTION123400

mergekit

Tools for merging pretrained large language models.

Language:PythonLGPL-3.0458300