CodeScore: Evaluating Code Generation by Learning Code Execution

CodeScore is a novel code evaluation metric (CEM) designed to assess the functional correctness of generated code using a large language model (LLM)-based approach. It overcomes the limitations of traditional match-based CEMs by focusing on the functional equivalence of code and supporting multiple input formats.

Features

Functional Correctness: Evaluates code based on functional correctness rather than surface-level similarities.
Versatile Input Formats: Supports Ref-only, NL-only, and Ref&NL input formats.
Unified Code Evaluation: Employs the UniCE framework for consistent and accurate code evaluation.

Installation

Clone the repository and install the required dependencies:

git clone https://github.com/Dingjz/CodeScore.git
cd CodeScore
pip install -r requirements.txt

Usage

Input and Output

Input Formats

We format the input sequences as:

[CLS] generated_code [SEP] reference_code [SEP] natural_language [SEP]

where [CLS] and [SEP] are special tokens in the vocabulary. The generated_code, reference_code, and natural_language are placeholders for the generated code, reference code, and natural language description, respectively.

For training and testing datasets, you can refer to our open-source datasets for organization formats.

Outputs

The outputs include two keys:

scores: Represents the CodeScore value.
passeds: Indicates whether the generated code compiled successfully.

Inference

To perform inference using CodeScore, run:

python inference.py --cfg configs/models/unified_metric.yaml --ckpt_path your/model/path --test_file your/testfile/path --out_file save/path

Training

To train the model using CodeScore, run:

python train.py --cfg configs/models/unified_metric.yaml

Parameters

--cfg: Path to the configuration file.
--ckpt_path: Path to the model checkpoint.
--test_file: Path to the test file.
--out_file: Path to save the results.

Checkpoints

For model checkpoints, you can check CodeScore on Hugging Face.

Methodology

CodeScore employs a unified code evaluation learning framework called UniCE, which uses LLMs to learn code execution. The functional correctness of the generated code is measured using PassRatio and Executability.

PassRatio

PassRatio is the proportion of test cases that the generated code passes:

PassRatio = (Number of passed test cases) / (Total number of test cases)

Executability

Executability indicates whether the generated code can be successfully executed without errors.

Performance

CodeScore has demonstrated superior performance over traditional match-based and LLM-based evaluation metrics, achieving state-of-the-art results in multiple code evaluation tasks.

Contribution

Contributions are welcome! Please fork the repository, make your changes, and submit a pull request.

License

This project is licensed under the MIT License.

Contact

For any questions or issues, please contact dingjz@stu.pku.edu.cn.

References

If you use CodeScore in your research, please cite the following paper:

@article{dong2023codescore,
  title={CodeScore: Evaluating Code Generation by Learning Code Execution},
  author={Yihong Dong, Jiazheng Ding, Xue Jiang, Ge Li, Zhuo Li, and Zhi Jin},
  journal={arXiv preprint arXiv:2301.09043},
  year={2023}
}

Dingjz / CodeScore