dgo2dance / MobileAgent

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception


Junyang Wang1, Haiyang Xu2†, Jiabo Ye2, Ming Yan2†,
Weizhou Shen2, Ji Zhang2, Fei Huang2, Jitao Sang1†
{junyangwang, jtsang}@bjtu.edu.cn, {shuofeng.xhy, ym119608}@alibaba-inc.com

1Beijing Jiaotong University 2Alibaba Group
Corresponding author

📋Introduction

  • Pure visual solution, independent of XML and system metadata.
  • Unrestricted operation scope, capable of multi-app operations.
  • Multiple visual perception tools for operation localization.
  • No need for exploration and training, plug and play.

📢News

  • [1.31] 🔥Our code is available! Welcome to try Mobile-Agent.
  • [1.31] 🔥Human-operated data in Mobile-Eval is in preparation and will be open-sourced soon.
  • [1.30] Our paper is available at LINK.
  • [1.30] Our evaluation results on Mobile-Eval are available.
  • [1.30] The code and Mobile-Eval benchmark are coming soon!

📺Demo

Mobile-Agent.mp4

🔧Getting Started

❗Since the GPT-4V will have severe hallucinations when perceiving non-English screenshots, we strongly recommend using Mobile-Agent under English-only systems and apps to ensure the performance.

Installation

git clone https://github.com/X-PLUG/MobileAgent.git
cd MobileAgent
pip install -r requirements.txt

Preparation for Connecting Mobile Device

  1. Download the Android Debug Bridge.
  2. Turn on the ADB debugging switch on your Android phone, it needs to be turned on in the developer options first.
  3. Connect your phone to the computer with a data cable and select "Transfer files".
  4. Test your ADB environment as follow: /path/to/adb devices. If the connected devices are displayed, the preparation is complete.
  5. If you are using a MAC or Linux system, make sure to turn on adb permissions as follow: sudo chmod +x /path/to/adb
  6. If you are using Windows system, the path will be xx/xx/adb.exe

Preparation for Visual Perception Tools

  1. Download the icon detection model Grounding DION
  2. The text detection model will be automatically downloaded from modelscope after you run Mobile-Agent.

Run

python run.py \
--grounding_ckpt /path/to/GroundingDION \
--adb_path /path/to/adb \
--api "your API_TOKEN" \
--instruction "your instruction"

📱Mobile-Eval

Mobile-Eval is a benchmark designed for evaluating the performance of mobile device agents. This benchmark includes 10 mainstream single-app scenarios and 1 multi-app scenario.

For each scenario, we have designed three instructions:

  • Instruction 1: relatively simple and basic task
  • Instruction 2: additional requirements added on top of the difficulty of Instruction 1
  • Instruction 3: user demands with no explicit task indication

The detailed content of Mobile-Eval is as follows:

Application Instruction
Alibaba.com 1. Help me find caps in Alibaba.com.
2. Help me find caps in Alibaba.com. If the "Add to cart" is avaliable in the item information page, please add the item to my cart.
3. I want to buy a cap. I've heard things are cheap on Alibaba.com. Maybe you can find it for me.
Amazon Music 1. Search singer Jay Chou in Amazon Music.
2. Search a music about "agent" in Amazon Music and play it.
3. I want to listen music to relax. Find an App to help me.
Chrome 1. Search result for today's Lakers game.
2. Search the information about Taylor Swift.
3. I want to know the result for today's Lakers game. Find an App to help me.
Gmail 1. Send an empty email to to {address}.
2. Send an email to {address}n to tell my new work.
3. I want to let my friend know my new work, and his address is {address}. Find an App to help me.
Google Maps 1. Navigate to Hangzhou West Lake.
2. Navigate to a nearby gas station.
3. I want to go to Hangzhou West Lake, but I don't know the way. Find an App to help me.
Google Play 1. Download WhatsApp in Play Store.
2. Download Instagram in Play Store.
3. I want WhatsApp on my phone. Find an App to help me.
Notes 1. Create a new note in Notes.
2. Create a new note in Notes and write "Hello, this is a note", then save it.
3. I suddenly have something to record, so help me find an App and write down the following content: meeting at 3pm.
Settings 1. Turn on the dark mode.
2. Turn on the airplane mode.
3. I want to see the real time internet speed at the battery level, please turn on this setting for me.
TikTok 1. Swipe a video about pet cat in TikTok and click a "like" for this video.
2. Swipe a video about pet cat in TikTok and comment "Ohhhh, so cute cat!".
3. Swipe videos in TikTok. Click "like" for 3 pet video cat.
YouTube 1. Search for videos about Stephen Curry on YouTube.
2. Search for videos about Stephen Curry on YouTube and open "Comments" to comment "Oh, chef, your basketball spirit has always inspired me".
3. I need you to help me show my love for Stephen Curry on YouTube.
Multi-App 1. Open the calendar and look at today's date, then go to Notes and create a new note to write "Today is {today's data}".
2. Check the temperature in the next 5 days, and then create a new note in Notes and write a temperature analysis.
3. Search the result for today's Lakers game, and then create a note in Notes to write a sport news for this result.

📝Evaluation results

We evaluated Mobile-Agent on Mobile-Eval. The evaluation results are available at LINK.

  • We have stored the evaluation results for the 10 apps and the multi-app scenario in folders named after each app.
  • The numbers within each app's folder represent the results for different types of instruction within that app.
  • For example, if you want to view the results of Mobile-Agent for the second instruction in Google Maps, you should go to the following path:results/Google Maps/2.
  • If the last action of Mobile-Agent is not "stop", it indicates that Mobile-Agent did not complete the corresponding instruction. During the evaluation, we manually terminated these cases where completion was not possible.

📄To-do List

  • Development of Mobile-Agent app on Android platform.
  • Adaptation to other mobile device platforms.

Related Projects

About

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

License:MIT License


Languages

Language:Python 80.5%Language:Cuda 17.5%Language:C++ 2.0%