ChaimZhu (ZCMax)

ZCMax

Geek Repo

Company:HKU IDS | HKU-MMLab

Location:Hong Kong SAR

Github PK Tool:Github PK Tool

ChaimZhu's starred repositories

GPT-SoVITS

1 min voice data can also be used to train a good TTS model! (few shot voice cloning)

Language:PythonLicense:MITStargazers:25359Issues:170Issues:806

CLIP

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image

Language:Jupyter NotebookLicense:MITStargazers:22691Issues:312Issues:382

llama3

The official Meta Llama 3 GitHub site

Language:PythonLicense:NOASSERTIONStargazers:20824Issues:164Issues:149

Open-Sora

Open-Sora: Democratizing Efficient Video Production for All

Language:PythonLicense:Apache-2.0Stargazers:17075Issues:155Issues:262
Language:PythonLicense:NOASSERTIONStargazers:7928Issues:149Issues:0

rerun

Visualize streams of multimodal data. Fast, easy to use, and simple to integrate. Built in Rust using egui.

Language:RustLicense:Apache-2.0Stargazers:5345Issues:57Issues:2522

MGM

Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"

Language:PythonLicense:Apache-2.0Stargazers:3018Issues:25Issues:114

Awesome-LLMs-for-Video-Understanding

🔥🔥🔥Latest Papers, Codes and Datasets on Vid-LLMs.

Awesome-LLM-3D

Awesome-LLM-3D: a curated list of Multi-modal Large Language Model in 3D world Resources

LLaVA-pp

🔥🔥 LLaVA++: Extending LLaVA with Phi-3 and LLaMA-3 (LLaVA LLaMA-3, LLaVA Phi-3)

Chat-UniVi

[CVPR 2024 Highlight🔥] Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

Language:PythonLicense:Apache-2.0Stargazers:654Issues:7Issues:32

LLaVA-Plus-Codebase

LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills

Language:PythonLicense:Apache-2.0Stargazers:639Issues:10Issues:23

VLMEvalKit

Open-source evaluation toolkit of large vision-language models (LVLMs), support GPT-4v, Gemini, QwenVLPlus, 40+ HF models, 20+ benchmarks

Language:PythonLicense:Apache-2.0Stargazers:479Issues:8Issues:67

Mask3D

Mask3D predicts accurate 3D semantic instances achieving state-of-the-art on ScanNet, ScanNet200, S3DIS and STPLS3D.

Language:PythonLicense:MITStargazers:476Issues:9Issues:158

vstar

PyTorch Implementation of "V* : Guided Visual Search as a Core Mechanism in Multimodal LLMs"

Language:PythonLicense:MITStargazers:443Issues:10Issues:13

qingwu-zimu

青梧字幕是一款基于whisper的AI字幕提取工具

Language:C++License:MITStargazers:371Issues:4Issues:2

omnidata

A Scalable Pipeline for Making Steerable Multi-Task Mid-Level Vision Datasets from 3D Scans [ICCV 2021]

Language:Jupyter NotebookLicense:NOASSERTIONStargazers:364Issues:10Issues:58

PLLaVA

Official repository for the paper PLLaVA

Stratified-Transformer

Stratified Transformer for 3D Point Cloud Segmentation (CVPR 2022)

Language:PythonLicense:MITStargazers:346Issues:6Issues:97

Awesome-Open-AI-Sora

Sora AI Awesome List – Your go-to resource hub for all things Sora AI, OpenAI's groundbreaking model for crafting realistic scenes from text. Explore a curated collection of articles, videos, podcasts, and news about Sora's capabilities, advancements, and more.

License:Apache-2.0Stargazers:196Issues:5Issues:0

probe3d

[CVPR 2024] Probing the 3D Awareness of Visual Foundation Models

Language:PythonLicense:MITStargazers:193Issues:5Issues:4

NExT-Chat

The code of the paper "NExT-Chat: An LMM for Chat, Detection and Segmentation".

Language:PythonLicense:Apache-2.0Stargazers:169Issues:2Issues:18

3D-VLA

Source codes for "3D-VLA: A 3D Vision-Language-Action Generative World Model"

Stargazers:169Issues:0Issues:0

open-eqa

OpenEQA Embodied Question Answering in the Era of Foundation Models

Language:Jupyter NotebookLicense:MITStargazers:157Issues:11Issues:5

multi_token

Embed arbitrary modalities (images, audio, documents, etc) into large language models.

Language:PythonLicense:Apache-2.0Stargazers:143Issues:3Issues:14

VQASynth

Compose multimodal datasets 🎹

Language:PythonStargazers:82Issues:0Issues:0

act3d-chained-diffuser

A unified architecture for multimodal multi-task robotic policy learning.

Online3D

[CVPR 2024] Memory-based Adapters for Online 3D Scene Perception

Language:PythonLicense:MITStargazers:48Issues:3Issues:0
Language:JavaScriptStargazers:11Issues:2Issues:0