EffiBench: Benchmarking the Efficiency of Automatically Generated Code

📍 Abstract

Code generation models have increasingly become integral to aiding software development, offering assistance in tasks such as code completion, debugging, and code translation. Although current research has thoroughly examined the correctness of code produced by code generation models, a vital aspect — the efficiency of the generated code — has often been neglected. This paper presents EffiBench, a benchmark with 1,000 efficiency-critical coding problems for assessing the efficiency of code generated by code generation models. EffiBench contains a diverse set of LeetCode coding problems. Each problem is paired with an executable human-written canonical solution. With EffiBench, we empirically examine the capability of 21 Large Language Models (13 open-sourced and 8 closed-sourced) in generating efficient code. The results demonstrate that GPT-4-turbo generates the most efficient code, significantly outperforming Palm-2-chat-bison, Claude-instant-1, Gemini-pro, GPT-4, and GPT-3.5. Nevertheless, its code efficiency is still worse than the efficiency of human-written canonical solutions. In particular, the average / worst execution time of GPT-4-turbo generated code is 1.69 / 45.49 times that of the canonical solutions.

🚀 Updates

02/21/2024: Code released

04/15/2024: HuggingFace: EffiBench

Installation

git clone git@github.com:huangd1999/EffiBench.git
cd EffiBench
pip install -r requirements.txt

Evaluation on EffiBench

Our evaluation consists of two steps: generation and metrics calculation.

Generation

Open-sourced Models

For open-sourced models like StarCoder, DeepSeek-Coder, etc., we provide batch inference scripts for fast inference on EffiBench.

cd ./src
mkdir results
python open_source_model_completion.py \
  --model codellama/CodeLlama-70b-Instruct-hf

OpenAI models

OpenAI models are accessible through an API. You may use the following script:

cd ./src
mkdir results
python closed_source_model_completion.py \
  --model gpt-3.5-turbo-0301

Metrics Calculation

After obtaining the generation, we can calculate the final metrics

cd ./src
python code_efficiency_calculator.py \
  --model gpt-3.5-turbo-0301
python report_overhead.py \
  --model gpt-3.5-turbo-0301

Citation

@article{huang2024effibench,
  title={EffiBench: Benchmarking the Efficiency of Automatically Generated Code},
  author={Huang, Dong and Zhang, Jie M and Qing, Yuhao and Cui, Heming},
  journal={arXiv preprint arXiv:2402.02037},
  year={2024}
}

Questions

Please feel free to email us (email addresses in the paper. You may also submit an issue in this repo.

License

This project is licensed under the Apache-2.0 License.

WinDB3ll / EffiBench