Natyren

followers

following

stars

George's repositories

Auto-GUI

Official implementation for "You Only Look at Screens: Multimodal Chain-of-Action Agents" (Findings of ACL 2024)

Language:PythonApache-2.0000

cardie

An open source business card designer and sharing platform

GPL-3.0000

conditional-flow-matching

TorchCFM: a Conditional Flow Matching library

Language:PythonMIT000

digirl

Official repo for paper DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning.

Language:Python000

Discffusion

Official repo for the paper "Discffusion: Discriminative Diffusion Models as Few-shot Vision and Language Learners"

MIT000

goodcatch

Open-source attempt to implement tiny vision-language model which works well with text-rich images

010

InternLM-XComposer

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

000

kosmos-2.5-gradio

Script to easy (from the bbox inference and deployment) of kosmos-2.5

Apache-2.0000

lerobot

🤗 LeRobot: End-to-end Learning for Real-World Robotics in Pytorch

Apache-2.0000

llama2d

2D Positional Embeddings for Webpage Structural Understanding 🦙👀

Language:PythonGPL-3.0000

LLaVA-NeXT

000

MiniCPM-V

MiniCPM-Llama3-V 2.5: A GPT-4V Level Multimodal LLM on Your Phone

Apache-2.0000

ml-ferret

NOASSERTION000

mmbench-ru-eval

Repository to simple evaluation your results on MMBench-DEV-RU

Language:Python000

MoneyPrinterTurbo

Generate short videos with one click using AI LLM.

Language:PythonMIT000

moondream

tiny vision language model

000

mpa-archive

Crawls a Multi-Page Application to a zip file, serve the Multi-Page Application from the zip file. A MPA archiver. Could be used as a Site Generator

MIT000

natyren.github.io

010

Open-LLaVA-NeXT

An open-source implementation for training LLaVA-NeXT.

000

RL4VLM

Official Repo for Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning

MIT000

screenshot-to-code

Drop in a screenshot and convert it to clean code (HTML/Tailwind/React/Vue)

MIT000

SeeAct

SeeAct is a system for generalist web agents that autonomously carry out tasks on any given website, with a focus on large multimodal models (LMMs) such as GPT-4V(ision).

Language:PythonNOASSERTION000

SeeClick

The model, data and code for the visual GUI Agent SeeClick

Language:HTML000

self-operating-computer

A framework to enable multimodal models to operate a computer.

MIT000

text-generation-webui

A Gradio web UI for Large Language Models. Supports transformers, GPTQ, AWQ, EXL2, llama.cpp (GGUF), Llama models.

Language:PythonAGPL-3.0000

transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

Language:PythonApache-2.0000

trl

Train transformer language models with reinforcement learning.

Apache-2.0000

vimGPT

Browse the web with GPT-4V and Vimium

Language:PythonMIT000

VLMEvalKit

Open-source evaluation toolkit of large vision-language models (LVLMs), support GPT-4v, Gemini, QwenVLPlus, 40+ HF models, 20+ benchmarks

Language:PythonApache-2.0000

YOLO-World

[CVPR 2024] Real-Time Open-Vocabulary Object Detection

GPL-3.0000