secure-software-engineering / TypeEvalPy

A Micro-benchmarking Framework for Python Type Inference Tools

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool


A Micro-benchmarking Framework for Python Type Inference Tools

πŸ“Œ Features:

  • πŸ“œ Contains 154 code snippets to test and benchmark.
  • 🏷 Offers 845 type annotations across a diverse set of Python functionalities.
  • πŸ“‚ Organized into 18 distinct categories targeting various Python features.
  • 🚒 Seamlessly manages the execution of containerized tools.
  • πŸ”„ Efficiently transforms inferred types into a standardized format.
  • πŸ“Š Automatically produces meaningful metrics for in-depth assessment and comparison.

πŸ› οΈ Supported Tools

Supported βœ… In-progress πŸ”§ Planned πŸ’‘
HeaderGen Intellij PSI MonkeyType
Jedi Pyre Pyannotate
Pyright PySonar2
HiTyper Pytype
Scalpel TypeT5
Type4Py
GPT-4
Ollama


πŸ† TypeEvalPy Leaderboard

Below is a comparison showcasing exact matches across different tools, coupled with top_n predictions for ML-based tools.

Rank πŸ› οΈ Tool Top-n Function Return Type Function Parameter Type Local Variable Type Total
1 HeaderGen 1 186 56 322 564
2 Jedi 1 122 0 293 415
3 Pyright 1 100 8 297 405
4 HiTyper 1
3
5
163
173
175
27
37
37
179
225
229
369
435
441
5 HiTyper (static) 1 141 7 102 250
6 Scalpel 1 155 32 6 193
7 Type4Py 1
3
5
39
103
109
19
31
31
99
167
174
157
301
314

(Auto-generated based on the the analysis run on 20 Oct 2023)


πŸ†πŸ€– TypeEvalPy LLM Leaderboard

Below is a comparison showcasing exact matches for LLMs.

Rank πŸ› οΈ Tool Function Return Type Function Parameter Type Local Variable Type Total
1 GPT-4 225 85 465 775
2 Finetuned:GPT 3.5 209 85 436 730
3 codellama:13b-instruct 199 75 425 699
4 GPT 3.5 Turbo 188 73 429 690
5 codellama:34b-instruct 190 52 425 667
6 phind-codellama:34b-v2 182 60 399 641
7 codellama:7b-instruct 171 72 384 627
8 dolphin-mistral 184 76 356 616
9 codebooga 186 56 354 596
10 llama2:70b 168 55 342 565
11 HeaderGen 186 56 321 563
12 wizardcoder:13b-python 170 74 317 561
13 llama2:13b 153 40 283 476
14 mistral:instruct 155 45 250 450
15 mistral:v0.2 155 45 248 448
16 vicuna:13b 153 35 260 448
17 vicuna:33b 133 29 267 429
18 Jedi 122 0 293 415
19 Pyright 100 8 297 405
19 wizardcoder:7b-python 103 48 254 405
20 llama2:7b 140 34 216 390
21 HiTyper 163 27 179 369
22 wizardcoder:34b-python 140 43 178 361
23 orca2:7b 117 27 184 328
24 vicuna:7b 131 17 172 320
25 orca2:13b 113 19 166 298
26 Scalpel 155 32 6 193
27 Type4Py 39 19 99 157
28 tinyllama 3 0 23 26
29 phind-codellama:34b-python 5 0 15 20
30 codellama:13b-python 0 0 0 0
31 codellama:34b-python 0 0 0 0
32 codellama:7b-python 0 0 0 0

(Auto-generated based on the the analysis run on 14 Jan 2024)


🐳 Running with Docker

1️⃣ Clone the repo

git clone https://github.com/secure-software-engineering/TypeEvalPy.git

2️⃣ Build Docker image

docker build -t typeevalpy .

3️⃣ Run TypeEvalPy

πŸ•’ Takes about 30mins on first run to build Docker containers.

πŸ“‚ Results will be generated in the results folder within the root directory of the repository. Each results folder will have a timestamp, allowing you to easily track and compare different runs.

Correlation of CSV Files Generated to Tables in ICSE Paper Here is how the auto-generated CSV tables relate to the paper's tables:
  • Table 1 in the paper is derived from three auto-generated CSV tables:

    • paper_table_1.csv - details Exact matches by type category.
    • paper_table_2.csv - lists Exact matches for 18 micro-benchmark categories.
    • paper_table_3.csv - provides Sound and Complete values for tools.
  • Table 2 in the paper is based on the following CSV table:

    • paper_table_5.csv - shows Exact matches with top_n values for machine learning tools.

Additionally, there are CSV tables that are not included in the paper:

  • paper_table_4.csv - containing Sound and Complete values for 18 micro-benchmark categories.
  • paper_table_6.csv - featuring Sensitivity analysis.
docker run \
      -v /var/run/docker.sock:/var/run/docker.sock \
      -v ./results:/app/results \
      typeevalpy

πŸ”§ Optionally, run analysis on specific tools:

docker run \
      -v /var/run/docker.sock:/var/run/docker.sock \
      -v ./results:/app/results \
      typeevalpy --runners headergen scalpel

πŸ› οΈ Available options: headergen, pyright, scalpel, jedi, hityper, type4py, hityperdl

πŸ€– Running TypeEvalPy with LLMs

TypeEvalPy integrates with LLMs through Ollama, streamlining their management. Begin by setting up your environment:

  • Create Configuration File: Copy the config_template.yaml from the src directory and rename it to config.yaml.

In the config.yaml, configure in the following:

  • openai_key: your key for accessing OpenAI's models.
  • ollama_url: the URL for your Ollama instance. For simplicity, we recommend deploying Ollama using their Docker container. Get started with Ollama here.
  • prompt_id: set this to questions_based_2 for optimal performance, based on our tests.
  • ollama_models: select a list of model tags from the Ollama library. For better operation, ensure the model is pre-downloaded with the ollama pull command.

With the config.yaml configured, run the following command:

docker run \
      -v /var/run/docker.sock:/var/run/docker.sock \
      -v ./results:/app/results \
      typeevalpy --runners ollama

Running From Source...

1. πŸ“₯ Installation

  1. Clone the repo

    git clone https://github.com/secure-software-engineering/TypeEvalPy.git
  2. Install Dependencies and Set Up Virtual Environment

    Run the following commands to set up your virtual environment and activate the virtual environment.

    python3 -m venv .env
    source .env/bin/activate
    pip install -r requirements.txt

2. πŸš€ Usage: Running the Analysis

  1. Navigate to the src Directory

    cd src
  2. Execute the Analyzer

    Run the following command to start the benchmarking process on all tools:

    python main_runner.py

    or

    Run analysis on specific tools

    python main_runner.py --runners headergen scalpel
    

🀝 Contributing

Thank you for your interest in contributing! To add support for a new tool, please utilize the Docker templates provided in our repository. After implementing and testing your tool, please submit a pull request (PR) with a descriptive message. Our maintainers will review your submission, and merge them.

To get started with integrating your tool, please follow the guide here: docs/Tool_Integration_Guide.md


⭐️ Show Your Support

Give a ⭐️ if this project helped you!

About

A Micro-benchmarking Framework for Python Type Inference Tools


Languages

Language:Python 95.8%Language:Jupyter Notebook 2.4%Language:Dockerfile 1.1%Language:Shell 0.6%Language:Makefile 0.1%