ZCMax

ChaimZhu's starred repositories

openvla

OpenVLA: An open-source vision-language-action model for robotic manipulation.

Language:PythonMIT36900

acad-homepage.github.io

AcadHomepage: A Modern and Responsive Academic Personal Homepage

Language:SCSSMIT99500

conv-llava

Language:PythonApache-2.07000

gpu_poor

Calculate token/s & GPU memory requirement for any LLM. Supports llama.cpp/ggml/bnb/QLoRA quantization

Language:JavaScript68600

llama3-from-scratch

llama3 implementation one matrix multiplication at a time

Language:Jupyter NotebookMIT1052000

NS3D

Language:Python3800

LLaVA-NeXT

Language:Python78800

behavior-vision-suite.github.io

Language:CSSMIT10700

ml-visuals

🎨 ML Visuals contains figures and templates which you can reuse and customize to improve your scientific writing.

MIT1264100

Stratified-Transformer

Stratified Transformer for 3D Point Cloud Segmentation (CVPR 2022)

Language:PythonMIT35000

rerun

Visualize streams of multimodal data. Fast, easy to use, and simple to integrate. Built in Rust using egui.

Language:RustApache-2.0555000

PLLaVA

Official repository for the paper PLLaVA

Language:Python43200

LLaVA-pp

🔥🔥 LLaVA++: Extending LLaVA with Phi-3 and LLaMA-3 (LLaVA LLaMA-3, LLaVA Phi-3)

Language:Python71800

omnidata

A Scalable Pipeline for Making Steerable Multi-Task Mid-Level Vision Datasets from 3D Scans [ICCV 2021]

Language:Jupyter NotebookNOASSERTION37200

GPT-SoVITS

1 min voice data can also be used to train a good TTS model! (few shot voice cloning)

Language:PythonMIT2762900

3D-VLA

[ICML 2024] 3D-VLA: A 3D Vision-Language-Action Generative World Model

Language:Python19900

llama3

The official Meta Llama 3 GitHub site

Language:PythonNOASSERTION2222900

probe3d

[CVPR 2024] Probing the 3D Awareness of Visual Foundation Models

Language:PythonMIT21100

ml-ferret

Language:PythonNOASSERTION816000

open-eqa

OpenEQA Embodied Question Answering in the Era of Foundation Models

Language:Jupyter NotebookMIT17300

SceneVerse

Language:PythonMIT11800

Awesome-LLMs-for-Video-Understanding

🔥🔥🔥Latest Papers, Codes and Datasets on Vid-LLMs.

84900

MGM

Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"

Language:PythonApache-2.0305300

LLaVA-Plus-Codebase

LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills

Language:PythonApache-2.064900

VQASynth

Compose multimodal datasets 🎹

Language:Python10900

multi_token

Embed arbitrary modalities (images, audio, documents, etc) into large language models.

Language:PythonApache-2.015400

Chat-UniVi

[CVPR 2024 Highlight🔥] Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

Language:PythonApache-2.068800

VLMEvalKit

Open-source evaluation toolkit of large vision-language models (LVLMs), support GPT-4v, Gemini, QwenVLPlus, 50+ HF models, 20+ benchmarks

Language:PythonApache-2.056500

CLIP

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image

Language:Jupyter NotebookMIT2319700

act3d-chained-diffuser

A unified architecture for multimodal multi-task robotic policy learning.

Language:Python9100