Code generation models have increasingly become integral to aiding software development, offering assistance in tasks such as code completion, debugging, and code translation. Although current research has thoroughly examined the correctness of code produced by code generation models, a vital aspect โ the efficiency of the generated code โ has often been neglected. This paper presents EffiBench, a benchmark with 1,000 efficiency-critical coding problems for assessing the efficiency of code generated by code generation models. EffiBench contains a diverse set of LeetCode coding problems. Each problem is paired with an executable human-written canonical solution. With EffiBench, we empirically examine the capability of 21 Large Language Models (13 open-sourced and 8 closed-sourced) in generating efficient code. The results demonstrate that GPT-4-turbo generates the most efficient code, significantly outperforming Palm-2-chat-bison, Claude-instant-1, Gemini-pro, GPT-4, and GPT-3.5. Nevertheless, its code efficiency is still worse than the efficiency of human-written canonical solutions. In particular, the average / worst execution time of GPT-4-turbo generated code is 1.69 / 45.49 times that of the canonical solutions.
02/21/2024: Code released
04/15/2024: HuggingFace: EffiBench
git clone git@github.com:huangd1999/EffiBench.git
cd EffiBench
pip install -r requirements.txt
Our evaluation consists of two steps: generation and metrics calculation.
For open-sourced models like StarCoder, DeepSeek-Coder, etc., we provide batch inference scripts for fast inference on EffiBench.
cd ./src
mkdir results
python open_source_model_completion.py \
--model codellama/CodeLlama-70b-Instruct-hf
OpenAI models are accessible through an API. You may use the following script:
cd ./src
mkdir results
python closed_source_model_completion.py \
--model gpt-3.5-turbo-0301
After obtaining the generation, we can calculate the final metrics
cd ./src
python code_efficiency_calculator.py \
--model gpt-3.5-turbo-0301
python report_overhead.py \
--model gpt-3.5-turbo-0301
@article{huang2024effibench,
title={EffiBench: Benchmarking the Efficiency of Automatically Generated Code},
author={Huang, Dong and Zhang, Jie M and Qing, Yuhao and Cui, Heming},
journal={arXiv preprint arXiv:2402.02037},
year={2024}
}
Please feel free to email us (email addresses in the paper. You may also submit an issue in this repo.
This project is licensed under the Apache-2.0 License.