JiwenZhang (IMNearth)

IMNearth

Geek Repo

Company:Fudan University

Location:Shanghai

Home Page:https://imnearth.github.io/

Github PK Tool:Github PK Tool

JiwenZhang's starred repositories

Open-Sora

Open-Sora: Democratizing Efficient Video Production for All

Language:PythonLicense:Apache-2.0Stargazers:22108Issues:186Issues:490

MiniCPM-V

MiniCPM-V 2.6: A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone

Language:PythonLicense:Apache-2.0Stargazers:12432Issues:102Issues:565

UFO

A UI-Focused Agent for Windows OS Interaction.

Language:PythonLicense:MITStargazers:7838Issues:707Issues:19

MiniCPM

MiniCPM3-4B: An edge-side LLM that surpasses GPT-3.5-Turbo.

Language:Jupyter NotebookLicense:Apache-2.0Stargazers:7088Issues:75Issues:207

taming-transformers

Taming Transformers for High-Resolution Image Synthesis

Language:Jupyter NotebookLicense:MITStargazers:5773Issues:76Issues:220

latent-consistency-model

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

Language:PythonLicense:MITStargazers:4355Issues:62Issues:94

MGM

Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"

Language:PythonLicense:Apache-2.0Stargazers:3202Issues:28Issues:131

MobileAgent

Mobile-Agent: The Powerful Mobile Device Operation Assistant Family

Language:PythonLicense:MITStargazers:2918Issues:49Issues:58

Qwen2-VL

Qwen2-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.

Language:PythonLicense:Apache-2.0Stargazers:2861Issues:26Issues:337

ReAct

[ICLR 2023] ReAct: Synergizing Reasoning and Acting in Language Models

Language:Jupyter NotebookLicense:MITStargazers:1949Issues:17Issues:29

Emu

Emu Series: Generative Multimodal Models from BAAI

Language:PythonLicense:Apache-2.0Stargazers:1653Issues:21Issues:88

mPLUG-DocOwl

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

Language:PythonLicense:Apache-2.0Stargazers:1505Issues:29Issues:113

Qwen-Audio

The official repo of Qwen-Audio (通义千问-Audio) chat & pretrained large audio language model proposed by Alibaba Cloud.

Language:PythonLicense:NOASSERTIONStargazers:1462Issues:25Issues:66

SwissArmyTransformer

SwissArmyTransformer is a flexible and powerful library to develop your own Transformer variants.

Language:PythonLicense:Apache-2.0Stargazers:987Issues:32Issues:79

Show-o

Repository for Show-o, One Single Transformer to Unify Multimodal Understanding and Generation.

Language:PythonLicense:Apache-2.0Stargazers:984Issues:14Issues:42

MiniGPT-5

Official implementation of paper "MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens"

Language:PythonLicense:Apache-2.0Stargazers:851Issues:12Issues:44

groundingLMM

[CVPR 2024 🔥] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks.

self-refine

LLMs can generate feedback on their work, use it to improve the output, and repeat this process iteratively.

Language:PythonLicense:Apache-2.0Stargazers:611Issues:13Issues:20

InstructDiffusion

PyTorch implementation of InstructDiffusion, a unifying and generic framework for aligning computer vision tasks with human instructions.

Language:PythonLicense:NOASSERTIONStargazers:387Issues:10Issues:24

InfLLM

The code of our paper "InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences with Training-Free Memory"

Language:PythonLicense:MITStargazers:295Issues:16Issues:50

SeeClick

The model, data and code for the visual GUI Agent SeeClick

Language:HTMLLicense:Apache-2.0Stargazers:208Issues:2Issues:43

android_world

AndroidWorld is an environment and benchmark for autonomous agents

Language:PythonLicense:Apache-2.0Stargazers:119Issues:3Issues:9
Language:Jupyter NotebookLicense:NOASSERTIONStargazers:101Issues:6Issues:5

screen_qa

ScreenQA dataset was introduced in the "ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots" paper. It contains ~86K question-answer pairs collected by human annotators for ~35K screenshots from Rico. It should be used to train and evaluate models capable of screen content understanding via question answering.

GUICourse

GUICourse: From General Vision Langauge Models to Versatile GUI Agents

TextHawk

Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models

CoAT

Official implementation for "Android in the Zoo: Chain-of-Action-Thought for GUI Agents" (Findings of EMNLP 2024)

VoCoT

VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models