litwellchi

Xiaowei Chi's starred repositories

3D-VLA

[ICML 2024] 3D-VLA: A 3D Vision-Language-Action Generative World Model

Language:Python32400

phenaki-pytorch

Implementation of Phenaki Video, which uses Mask GIT to produce text guided videos of up to 2 minutes in length, in Pytorch

Language:PythonMIT74700

Awesome-Video-Robotic-Papers

This repository compiles a list of papers related to the application of video technology in the field of robotics! Star⭐ the repo and follow me if you like what you see🤩.

11400

videocrafter-training-pytorch

Training code for the videocrafter.

Language:PythonNOASSERTION400

MMTrail-Pytorch

[Arxiv 2024] Official code for MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions

200

awesome-diffusion-model-in-rl

A curated list of Diffusion Model in RL resources (continually updated)

Apache-2.076700

1xgpt

world modeling challenge for humanoid robots

Language:PythonApache-2.032300

Pointcept

Pointcept: a codebase for point cloud perception research. Latest works: PTv3 (CVPR'24 Oral), PPT (CVPR'24), OA-CNNs (CVPR'24), MSC (CVPR'23)

Language:PythonMIT154200

Open-Sora

Open-Sora: Democratizing Efficient Video Production for All

Language:PythonApache-2.02182600

Awesome-Embodied-AI

A curated list of awesome papers on Embodied AI and related research/industry-driven resources.

MIT26300

AlignProp uses direct reward backpropogation for the alignment of large-scale text-to-image diffusion models. Our method is 25x more sample and compute efficient than reinforcement learning methods (PPO) for finetuning Stable Diffusion

Language:PythonMIT23300

MMTrail

[Arxiv 2024] Official code for MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions

2200

1d-tokenizer

This repo contains the code for our paper An Image is Worth 32 Tokens for Reconstruction and Generation

Language:Jupyter NotebookApache-2.041500

MMWorld

Official repo of the paper "MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos"

Language:PythonMIT2000

calvin

CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks

Language:PythonMIT37300

open_flamingo

An open-source framework for training large multimodal models.

Language:PythonMIT369000

Pandora

Pandora: Towards General World Model with Natural Language Actions and Video States

Language:Python46900

Era3D

Language:PythonAGPL-3.051300

Awesome-Video-Datasets

Video datasets

115700

lvm_datapipe

data pipeline code of large video generation model

Language:Python700

Seeing-and-Hearing

[CVPR 2024] Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners

Language:PythonNOASSERTION11900

SVD_Xtend

Stable Video Diffusion Training Code and Extensions.

Language:Python57800

Latte

Latte: Latent Diffusion Transformer for Video Generation.

Language:PythonApache-2.0165800

LaVIT

LaVIT: Empower the Large Language Model to Understand and Generate Visual Content

Language:Jupyter NotebookNOASSERTION50900

MMDialog

The official site of paper MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards Multi-modal Open-domain Conversation

Language:Python18900

magvit

Official JAX implementation of MAGVIT: Masked Generative Video Transformer

Language:PythonApache-2.094600

magvit2-pytorch

Implementation of MagViT2 Tokenizer in Pytorch

Language:PythonMIT55100

AnimateDiff

Official implementation of AnimateDiff.

Language:PythonApache-2.01038600

M2Chat

Language:Python3200

Awesome-LLMs-meet-Multimodal-Generation

🔥🔥🔥 A curated list of papers on LLMs-based multimodal generation (image, video, 3D and audio).

Language:HTML31800