ScarletPan / Auto-GPT-Benchmarks

A repo built for the purpose of benchmarking the performance of agents, regardless of how they are set up and how they work.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

agbenchmark

A repo built for the purpose of benchmarking the performance of agents far and wide, regardless of how they are set up and how they work

MVP: function calls api, api returns presigned url, folder is uploaded, write file challenge is measured, score is given

Diagrams: https://whimsical.com/agbenchmark-5n4hXBq1ZGzBwRsK4TVY7x

Contributing

  • Make sure you have poetry installed - pip install poetry.

  • Then poetry install for dependencies

  • To add requirements poetry add requirement.

  • To run in venv poetry run python script.py

Feel free to merge with main at will (but also to ask for review) - if you can't send msg in R&D chat for access.

If you push at any point and break things - it'll happen to everyone - fix it asap. Step 1 is to revert main to last working commit

Let people know what beautiful code you write does, document everything well

Share your progress :)

Api

FastAPI with REST, import requests

POST hostname:8080/challenges
{
   "test_name": ""
   "challenge": "memory" - optional
}

Auth:

get preSignedUrl from API

POST preSignedUrl
{
   "artifacts": [{}]
}

Workspace

Kubernetes with AWS3 or GCP

Challenges

Dataset

Manually created, existing challenges within Auto-Gpt, https://osu-nlp-group.github.io/Mind2Web/

Simple challenge creation through a DSL (domain specific language)

Challenge TicTacToeCoding
    Description "The agent should implement a basic tic-tac-toe game in Python."
    Artifacts {
        Code "tictactoe.py"
    }
    Tasks {
        Code "Write a function to initialize the game board."
        Code "Write a function to handle a player's turn."
        Code "Write a function to check for a winning move."
        Test "Write tests for the blog post model, serializer, and view."
        Command "Run Django's test suite to ensure everything is working as expected."
    }
    SuccessCriteria {
        Correctness "The game should correctly alternate between two players."
        Correctness "The game should correctly identify a winning move."
        Efficiency "The game should not use unnecessary computational resources."
        Design "The solution should follow good practices for Django and Django Rest Framework."
    }
EndChallenge

Validators

Designed to handle specific types of output (e.g., text, code, structured data)

Logging

Log different requests coming in - write file, change file, etc. Maybe a db in the future for metrics, logs, etc

Written Challenges

For code, writing we can create a reference text and use metrics like METEOR, BERTScore, BARTScore

Repo

|-- agbenchmark/ **main project directory**
| |-- **init**.py
| |-- server/
| | |-- **init**.py
| | |-- api.py **opens server on host and exposes urls**
| | |-- utils.py
| |-- benchmark/
| | |-- **init**.py
| | |-- benchmark.py **combining scores, metrics, final evaluation**
| | |-- run.py **entry point. sets everything up**
| | |-- challenges/ **challenges across different metrics**
| | | |-- **init**.py
| | | |-- Challenge.py **easy challenge creation through Challenge class. potentially how DSL is defined. may need to inherit challenge class like Adaptability(Challenge)**
| | | |-- utils.py
| | | |-- adaptability.py
| | | |-- basic_abilities.py
| | | |-- code.py
| | | |-- memory.py
| | | |-- retrieval.py
| | | |-- web_navigation.py
| | | |-- writing.py
| |-- workspace/ **workspace related func**
| | |-- **init**.py
| | |-- workspace_manager.py **creation, deletion, preSignedUrl generation**
| | |-- cloud_services/
| | | |-- **init**.py
| | | |-- aws.py **not finalized, but write, read, and del files**
|-- tests/ **test func of agbenchmark**
| |-- **init**.py
| |-- test_api.py
| |-- test_benchmark.py
| |-- test_workspace_manager.py

Later: GitHub Actions integration, OpenAPI?, good versioning and backward compatibility

About

A repo built for the purpose of benchmarking the performance of agents, regardless of how they are set up and how they work.

License:MIT License


Languages

Language:Python 100.0%