JiwenZhang's starred repositories
taming-transformers
Taming Transformers for High-Resolution Image Synthesis
latent-consistency-model
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
MobileAgent
Mobile-Agent: The Powerful Mobile Device Operation Assistant Family
mPLUG-DocOwl
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
Qwen-Audio
The official repo of Qwen-Audio (通义千问-Audio) chat & pretrained large audio language model proposed by Alibaba Cloud.
clash-for-linux
clash-for-linux
SwissArmyTransformer
SwissArmyTransformer is a flexible and powerful library to develop your own Transformer variants.
groundingLMM
[CVPR 2024 🔥] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks.
self-refine
LLMs can generate feedback on their work, use it to improve the output, and repeat this process iteratively.
InstructDiffusion
PyTorch implementation of InstructDiffusion, a unifying and generic framework for aligning computer vision tasks with human instructions.
android_world
AndroidWorld is an environment and benchmark for autonomous agents
screen_qa
ScreenQA dataset was introduced in the "ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots" paper. It contains ~86K question-answer pairs collected by human annotators for ~35K screenshots from Rico. It should be used to train and evaluate models capable of screen content understanding via question answering.