IMNearth

JiwenZhang's starred repositories

Open-Sora

Open-Sora: Democratizing Efficient Video Production for All

Language:PythonApache-2.022108 186 490

MiniCPM-V

MiniCPM-V 2.6: A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone

Language:PythonApache-2.012432 102 565

UFO

A UI-Focused Agent for Windows OS Interaction.

Language:PythonMIT7838 707 19

MiniCPM

MiniCPM3-4B: An edge-side LLM that surpasses GPT-3.5-Turbo.

Language:Jupyter NotebookApache-2.07088 75 207

taming-transformers

Taming Transformers for High-Resolution Image Synthesis

Language:Jupyter NotebookMIT5773 76 220

latent-consistency-model

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

Language:PythonMIT4355 62 94

MGM

Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"

Language:PythonApache-2.03202 28 131

MobileAgent

Mobile-Agent: The Powerful Mobile Device Operation Assistant Family

Language:PythonMIT2918 49 58

Qwen2-VL

Qwen2-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.

Language:PythonApache-2.02861 26 337

ReAct

[ICLR 2023] ReAct: Synergizing Reasoning and Acting in Language Models

Language:Jupyter NotebookMIT1949 17 29

Emu

Emu Series: Generative Multimodal Models from BAAI

Language:PythonApache-2.01653 21 88

mPLUG-DocOwl

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

Language:PythonApache-2.01505 29 113

Qwen-Audio

The official repo of Qwen-Audio (通义千问-Audio) chat & pretrained large audio language model proposed by Alibaba Cloud.

Language:PythonNOASSERTION1462 25 66

SwissArmyTransformer

SwissArmyTransformer is a flexible and powerful library to develop your own Transformer variants.

Language:PythonApache-2.0987 32 79

Show-o

Repository for Show-o, One Single Transformer to Unify Multimodal Understanding and Generation.

Language:PythonApache-2.0984 14 42

MiniGPT-5

Official implementation of paper "MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens"

Language:PythonApache-2.0851 12 44

groundingLMM

[CVPR 2024 🔥] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks.

Language:Python772 28 73

self-refine

LLMs can generate feedback on their work, use it to improve the output, and repeat this process iteratively.

Language:PythonApache-2.0611 13 20

InstructDiffusion

PyTorch implementation of InstructDiffusion, a unifying and generic framework for aligning computer vision tasks with human instructions.

Language:PythonNOASSERTION387 10 24

InfLLM

The code of our paper "InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences with Training-Free Memory"

Language:PythonMIT295 16 50

SeeClick

The model, data and code for the visual GUI Agent SeeClick

Language:HTMLApache-2.0208 2 43

android_world

AndroidWorld is an environment and benchmark for autonomous agents

Language:PythonApache-2.0119 3 9

webui

Language:Jupyter NotebookNOASSERTION101 6 5

ScreenQA dataset was introduced in the "ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots" paper. It contains ~86K question-answer pairs collected by human annotators for ~35K screenshots from Rico. It should be used to train and evaluate models capable of screen content understanding via question answering.

CC-BY-4.090 6 1