yeungchenwa

Zhenhua Yang's starred repositories

Open-Sora

Open-Sora: Democratizing Efficient Video Production for All

Language:PythonApache-2.021444 179 454

Omost

Your image is almost there!

Language:PythonApache-2.07147 44 75

Otter

🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.

Language:PythonMIT3545 100 160

LLaMA2-Accessory

An Open-source Toolkit for LLM Development

Language:PythonNOASSERTION2670 37 134

LLaVA-NeXT

Language:PythonApache-2.02173 32 171

VLM_survey

Collection of AWESOME vision-language models for vision tasks

2139 120 7

Awesome-Text-to-Image

(ෆ`꒳´ෆ) A Survey on Text-to-Image Generation/Synthesis.

MIT2059 72 7

MambaOut

MambaOut: Do We Really Need Mamba for Vision?

Language:PythonApache-2.01944 6 243

ShareGPT4Video

An official implementation of ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

Language:Python1209 31 35

LlamaGen

Autoregressive Model Beats Diffusion: 🦙 Llama for Scalable Image Generation

Language:PythonMIT1167 21 52

RAG-Survey

Collecting awesome papers of RAG for AIGC. We propose a taxonomy of RAG foundations, enhancements, and applications in paper "Retrieval-Augmented Generation for AI-Generated Content: A Survey".

1040 24 3

Awesome-LLM4AD

A curated list of awesome LLM for Autonomous Driving resources (continually updated)

Apache-2.0856 38 5

VisionLLM

VisionLLM Series

Language:PythonApache-2.0820 42 13

alphafold3-pytorch

Implementation of Alphafold 3 in Pytorch

Language:PythonMIT819 40 28

Grounding-DINO-1.5-API

API for Grounding DINO 1.5: IDEA Research's Most Capable Open-World Object Detection Model Series

Language:PythonApache-2.0690 11 35

Campus2025

2025届互联网校招信息汇总

687 25 1

VideoLLaMA2

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Language:PythonApache-2.0685 10 66

GenerativeImage2Text

GIT: A Generative Image-to-text Transformer for Vision and Language

Language:PythonMIT540 9 58

DriveAGI

[CVPR 2024 Highlight] GenAD: Generalized Predictive Model for Autonomous Driving & Foundation Models in Autonomous System

Language:PythonApache-2.0521 27 7

Awesome-World-Model

Collect some World Models for Autonomous Driving papers.

363 180

Inf-DiT

Official implementation of Inf-DiT: Upsampling Any-Resolution Image with Memory-Efficient Diffusion Transformer

Language:PythonApache-2.0355 22 24

LayerDiffuse_DiffusersCLI

LayerDiffuse in pure diffusers without any GUI

Language:PythonApache-2.0288 7 9

TexTeller

TexTeller can convert image to latex formulas (image2latex, latex OCR) with higher accuracy and exhibits superior generalization ability, enabling it to cover most usage scenarios.

Language:PythonApache-2.0278 3 9

DocRes

[CVPR 2024] DocRes: A Generalist Model Toward Unifying Document Image Restoration Tasks

Language:PythonMIT263 6 9

mmdit

Implementation of a single layer of the MMDiT, proposed in Stable Diffusion 3, in Pytorch

Language:PythonMIT219 3 1

MotionLLM

[Arxiv-2024] MotionLLM: Understanding Human Behaviors from Human Motions and Videos

Language:PythonNOASSERTION204 2 8

nxtp

Object Recognition as Next Token Prediction (CVPR 2024)

Language:PythonNOASSERTION147 2 5

VimTS

VimTS: A Unified Video and Image Text Spotter

Language:PythonGPL-3.069 2 5

UPOCR

Official implementation of UPOCR: Towards unified pixel-level OCR interface (ICML 2024)

Language:Python3500

MegaHan97K

MegaHan97K: A Large-Scale Dataset for Mega-Category Chinese Character Recognition with over 97K Categories

Language:Python400