CodeScore is a novel code evaluation metric (CEM) designed to assess the functional correctness of generated code using a large language model (LLM)-based approach. It overcomes the limitations of traditional match-based CEMs by focusing on the functional equivalence of code and supporting multiple input formats.
- Functional Correctness: Evaluates code based on functional correctness rather than surface-level similarities.
- Versatile Input Formats: Supports Ref-only, NL-only, and Ref&NL input formats.
- Unified Code Evaluation: Employs the UniCE framework for consistent and accurate code evaluation.
Clone the repository and install the required dependencies:
git clone https://github.com/Dingjz/CodeScore.git
cd CodeScore
pip install -r requirements.txt
We format the input sequences as:
[CLS] generated_code [SEP] reference_code [SEP] natural_language [SEP]
where [CLS]
and [SEP]
are special tokens in the vocabulary. The generated_code
, reference_code
, and natural_language
are placeholders for the generated code, reference code, and natural language description, respectively.
For training and testing datasets, you can refer to our open-source datasets for organization formats.
The outputs include two keys:
scores
: Represents the CodeScore value.passeds
: Indicates whether the generated code compiled successfully.
To perform inference using CodeScore, run:
python inference.py --cfg configs/models/unified_metric.yaml --ckpt_path your/model/path --test_file your/testfile/path --out_file save/path
To train the model using CodeScore, run:
python train.py --cfg configs/models/unified_metric.yaml
--cfg
: Path to the configuration file.--ckpt_path
: Path to the model checkpoint.--test_file
: Path to the test file.--out_file
: Path to save the results.
For model checkpoints, you can check CodeScore on Hugging Face.
CodeScore employs a unified code evaluation learning framework called UniCE, which uses LLMs to learn code execution. The functional correctness of the generated code is measured using PassRatio and Executability.
PassRatio is the proportion of test cases that the generated code passes:
PassRatio = (Number of passed test cases) / (Total number of test cases)
Executability indicates whether the generated code can be successfully executed without errors.
CodeScore has demonstrated superior performance over traditional match-based and LLM-based evaluation metrics, achieving state-of-the-art results in multiple code evaluation tasks.
Contributions are welcome! Please fork the repository, make your changes, and submit a pull request.
This project is licensed under the MIT License.
For any questions or issues, please contact dingjz@stu.pku.edu.cn.
If you use CodeScore in your research, please cite the following paper:
@article{dong2023codescore,
title={CodeScore: Evaluating Code Generation by Learning Code Execution},
author={Yihong Dong, Jiazheng Ding, Xue Jiang, Ge Li, Zhuo Li, and Zhi Jin},
journal={arXiv preprint arXiv:2301.09043},
year={2023}
}