agbenchmark

A repo built for the purpose of benchmarking the performance of agents far and wide, regardless of how they are set up and how they work

MVP: function calls api, api returns presigned url, folder is uploaded, write file challenge is measured, score is given

Diagrams: https://whimsical.com/agbenchmark-5n4hXBq1ZGzBwRsK4TVY7x

Contributing

Make sure you have poetry installed - pip install poetry.
Then poetry install for dependencies
To add requirements poetry add requirement.
To run in venv poetry run python script.py

Feel free to merge with main at will (but also to ask for review) - if you can't send msg in R&D chat for access.

If you push at any point and break things - it'll happen to everyone - fix it asap. Step 1 is to revert main to last working commit

Let people know what beautiful code you write does, document everything well

Share your progress :)

Api

FastAPI with REST, import requests

POST hostname:8080/challenges
{
   "test_name": ""
   "challenge": "memory" - optional
}

Auth:

get preSignedUrl from API

POST preSignedUrl
{
   "artifacts": [{}]
}

Workspace

Kubernetes with AWS3 or GCP

Challenges

Dataset

Manually created, existing challenges within Auto-Gpt, https://osu-nlp-group.github.io/Mind2Web/

Simple challenge creation through a DSL (domain specific language)

Challenge TicTacToeCoding
    Description "The agent should implement a basic tic-tac-toe game in Python."
    Artifacts {
        Code "tictactoe.py"
    }
    Tasks {
        Code "Write a function to initialize the game board."
        Code "Write a function to handle a player's turn."
        Code "Write a function to check for a winning move."
        Test "Write tests for the blog post model, serializer, and view."
        Command "Run Django's test suite to ensure everything is working as expected."
    }
    SuccessCriteria {
        Correctness "The game should correctly alternate between two players."
        Correctness "The game should correctly identify a winning move."
        Efficiency "The game should not use unnecessary computational resources."
        Design "The solution should follow good practices for Django and Django Rest Framework."
    }
EndChallenge

Validators

Designed to handle specific types of output (e.g., text, code, structured data)

Logging

Log different requests coming in - write file, change file, etc. Maybe a db in the future for metrics, logs, etc

Written Challenges

For code, writing we can create a reference text and use metrics like METEOR, BERTScore, BARTScore

Repo

|-- agbenchmark/ **main project directory**
| |-- **init**.py
| |-- server/
| | |-- **init**.py
| | |-- api.py **opens server on host and exposes urls**
| | |-- utils.py
| |-- benchmark/
| | |-- **init**.py
| | |-- benchmark.py **combining scores, metrics, final evaluation**
| | |-- run.py **entry point. sets everything up**
| | |-- challenges/ **challenges across different metrics**
| | | |-- **init**.py
| | | |-- Challenge.py **easy challenge creation through Challenge class. potentially how DSL is defined. may need to inherit challenge class like Adaptability(Challenge)**
| | | |-- utils.py
| | | |-- adaptability.py
| | | |-- basic_abilities.py
| | | |-- code.py
| | | |-- memory.py
| | | |-- retrieval.py
| | | |-- web_navigation.py
| | | |-- writing.py
| |-- workspace/ **workspace related func**
| | |-- **init**.py
| | |-- workspace_manager.py **creation, deletion, preSignedUrl generation**
| | |-- cloud_services/
| | | |-- **init**.py
| | | |-- aws.py **not finalized, but write, read, and del files**
|-- tests/ **test func of agbenchmark**
| |-- **init**.py
| |-- test_api.py
| |-- test_benchmark.py
| |-- test_workspace_manager.py

Later: GitHub Actions integration, OpenAPI?, good versioning and backward compatibility

ScarletPan / Auto-GPT-Benchmarks