GAOKAO-bench

GAOKAO-bench is an evaluation framework that utilizes Chinese high school entrance examination (GAOKAO) questions as a dataset to evaluate the language understanding and logical reasoning abilities of large language models.

Introduction

In the past six months, OpenAI has released GPT-3.5-turbo and GPT-4, which have shown remarkable language understanding, logical reasoning, and rich language generation capabilities. Behind the impressive performance of these powerful models, it becomes evident that traditional model evaluation frameworks struggle to accurately assess the exceptional abilities of large language models. Therefore, we aim to establish a standardized and comprehensive evaluation framework to assess large models in all aspects accurately. In China, the National College Entrance Examination (known as GAOKAO) is one of the highest standardized and most comprehensive exams, widely recognized for its rigor. We have collected questions from the National College Entrance Examinations from 2010 to 2022, including the National Examination Papers I, II, and III, to construct the data portion of GAOKAO-bench.

Data Statistics

objective question

Multiple-choice questions

Questions	Number of Questions
2010-2022 Geography MCQs	34
2010-2022 History MCQs	287
2010-2022 Chemistry MCQs	124
2012-2022 English Cloze Test	26
2010-2022 English Fill in Blanks	30
2010-2022 Math(II) MCQs	218
2010-2022 Physics MCQs	64
2010-2013 English MCQs	105
2010-2022 English Reading Comp.	124
2010-2022 Chinese Modern Lit.	29
2010-2022 Chinese Lang. & Usage MCQs	56
2010-2022 Political Science MCQs	320
2010-2022 Math(I) MCQs	214
2010-2022 Biology MCQs	150
Total Number of Questions	1781

Subjective question

Fill-in-the-blank question

Questions	Number of Questions
2010-2022 Math(II) Fill-in-the-Blank	86
2014-2022 English Language Cloze Passage	23
2010-2022 Chinese Language Famous Passages and Sentences Dictation	28
2010-2022 Math(I) Fill-in-the-Blank	81
Total Number of Questions	218

Open-ended Questions

Questions	Number of Questions
2010-2022 Math(II) Open-ended Questions	122
2010-2022 Geography Open-ended Questions	28
2012-2022 English Language Error Correction	26
2010-2022 History Open-ended Questions	128
2010-2022 Chemistry Open-ended Questions	9
2010-2022 Biology Open-ended Questions	116
2010-2022 Math(I) Open-ended Questions	123
2010-2022 Political Science Open-ended Questions	60
2010-2022 Chinese Language Ancient Poetry Reading	29
2010-2022 Physics Open-ended Questions	47
2010-2022 Chinese Language Classical Chinese Reading	29
2010-2022 Chinese Language Language and Writing Skills Open-ended Questions	42
2010-2022 Chinese Language Practical Text Reading	24
2010-2022 Chinese Language Literary Text Reading	29
Total Number of Questions	690

Question type	Number of Questions	percentage
Multiple-choice questions	1781	63.36%
Fill-in-the-blank questions	218	7.76%
Open-ended Questions	812	28.89%
Total Number of Questions	2811	100%

JSON format specification

Field	Description
keywords	Question Title
example	List of questions, including specific information of each question
example/year	Year of the question in the college entrance examination
example/category	Category of the college entrance examination paper where the question is located
example/question	Question stem
example/answer	Answer to the question
example/analysis	Analysis of the question
example/index	Index of the question
example/score (for subjective questions)	Score of the question

The data format is as follows:

        {
            "year": "2010",
            "category": "（新课标）",
            "question": "1．（ 4分）西周分封制在**历史上影响深远。下列省、自治区中，其简称源\n自西周封国国名的是（   ） \nA．河南、河北  B．湖南、湖北  C．山东、山西  D．广东、广西\n",
            "answer": [
                "C"
            ],
            "analysis": "西周分封的诸侯国主要有鲁齐燕卫宋晋 。A项河南的简称是豫 ，河北的\n简称是冀； B项湖南的简称是湘，湖北的简称是鄂； D项广东的简称是粤，\n广西的简称是桂。其简称都不是源自西周封国国名， 故排除 ABD三项。  \nC项山东的简称是鲁 ，山西的简称是晋 ，其简称都是源自西周封国国名 。故C项\n正确。  \n故选： C。\n",
            "index": 0
        },

Model Output

The format of the model output is as follows:

    {
        "index": 0,
        "year": "2010",
        "category": "（新课标）",
        "question": "1．（ 4分）西周分封制在**历史上影响深远。下列省、自治区中，其简称源\n自西周封国国名的是（   ） \nA．河南、河北  B．湖南、湖北  C．山东、山西  D．广东、广西\n",
        "standard_answer": [
            "C"
        ],
        "analysis": "西周分封的诸侯国主要有鲁齐燕卫宋晋 。A项河南的简称是豫 ，河北的\n简称是冀； B项湖南的简称是湘，湖北的简称是鄂； D项广东的简称是粤，\n广西的简称是桂。其简称都不是源自西周封国国名， 故排除 ABD三项。  \nC项山东的简称是鲁 ，山西的简称是晋 ，其简称都是源自西周封国国名 。故C项\n正确。  \n故选： C。\n",
        "model_answer": [
            "C"
        ],
        "model_output": "【解析】\n西周时期实行的分封制度是指周王朝把自己的领土划分成诸侯国，由周王封授给亲戚、功臣等人，以维护王权，加强**集权。这些封国的名称多以地理特征、山川、河流、国都等命名，如鲁、卫、楚、秦等。因此，本题考查的是哪个省、自治区的简称源自西周封国国名，根据选项，可以排除D选项，因为广东、广西的简称并非源自西周封国国名。再根据**地理位置和历史文化背景，可以得出答案。\n【答案】C <eoa>"
    }

Evaluation

The framework of the evaluation is as follows:

File Name	Function
choice_bench.py	Generate answers for Multiple-choice questions
cloze_bench.py	Generate answers for Fill-in-the-blank questions
subjective_bench.py	Generate answers for Open-ended questions
bench_function.py	Contains Functions related to evaluation
选择题prompt	Prompts for Multiple-choice questions
填空题prompt	Prompts for Fill-in-the-blank questions
解答题prompt	Prompts for Open-ended questions
choice_test.py	Evaluates Multiple-choice questions

You can run the choice_bench.py/cloze_bench.py/subjective_bench.py to generate answers by using OpenAI API Keys. The evaluation framework supports gpt-3.5-turbo for Multiple-choice questions, Fill-in-the-blank questions and Open-ended questions; text-davinci-003 for Multiple-choice questions(except 2010-2022 Chinese Modern Lit.) and Fill-in-the-blank questions since some questions may exceed the model context length.

You can run the choice_test.py to evaluate the answers of Multiple-choice questions.

Requirement

joblib==1.1.0
openai==0.27.2

GAOKAO-bench

GAOKAO-bench是一个以**高考题目为数据集，测评中文大模型语言理解能力、逻辑推理能力的测评框架。

Introduction

在过去半年的时间里，openai发布了gpt-3.5-turbo和gpt-4。其展现出的语言理解能力、逻辑推理能力和丰富的语言生成能力令人惊叹。在其强大能力的背后，我们可以看到在大语言模型的背景下传统的模型评测框架难以对这些能力非凡的大模型做出准确的评测。因此我们希望能够建立一个标准化、综合性的评测框架来对大模型进行全方位、准确的评估。在**，高考是标准化水平最高、综合性最强并且认可程度最广的考试之一，我们希望借用高考的题目来评估大模型的能力。因此，我们收集了2010-2022年全国高考卷（涵盖全国卷I、II、III）的题目作为数据集构建起GAOKAO-bench的数据部分。

Data Statistics

客观题

选择题

题目名称	题目数量
2010_2022地理选择题	34
2010_2022历史选择题	287
2010_2022化学选择题	124
2012_2022英语七选五	26
2010_2022英语完形填空	30
2010_2022文数选择题	218
2010_2022物理选择题	64
2010_2013英语单项选择	105
2010_2022英语阅读理解	124
2010_2022语文现代文阅读	29
2010_2022语文语言文字运用选择题	56
2010_2022政治选择题	320
2010_2022理数选择题	214
2010_2022生物选择题	150
题目总数	1781

主观题

填空题

题目名称	题目数量
2010_2022文数填空题	86
2014_2022英语短文填词	23
2010_2022语文名篇名句默写	28
2010_2022理数填空题	81
题目总数	218

解答题

题目名称	题目数量
2010_2022文数解答题	122
2010_2022地理解答题	28
2012_2022英语短文改错	26
2010_2022历史解答题	128
2010_2022化学解答题	9
2010_2022生物解答题	116
2010_2022理数解答题	123
2010_2022政治解答题	60
2010_2022语文古代诗歌阅读	29
2010_2022物理解答题	47
2010_2022语文文言文阅读	29
2010_2022语文语言文字运用解答题	42
2010_2022语文实用类文本阅读	24
2010_2022语文文学类文本阅读	29
题目总数	812

题目类型	题目数量	数量占比
选择题	1781	63.36%
填空题	218	7.76%
解答题	812	28.89%
题目总数	2811	100%

json格式说明

字段	说明
keywords	题目年份，科目等信息
example	题目列表，包含题目具体信息
example/year	题目所在高考卷年份
example/category	题目所在高考卷类型
example/question	题目题干
example/answer	题目答案
example/analysis	题目解析
example/index	题目序号
example/score（主观题部分）	题目分值

数据格式如下所示：

{
  "year": "2010",
  "category": "（新课标）",
  "question": "1．（ 4分）西周分封制在**历史上影响深远。下列省、自治区中，其简称源\n自西周封国国名的是（   ） \nA．河南、河北  B．湖南、湖北  C．山东、山西  D．广东、广西\n",
  "answer": [
    "C"
  ],
  "analysis": "西周分封的诸侯国主要有鲁齐燕卫宋晋 。A项河南的简称是豫 ，河北的\n简称是冀； B项湖南的简称是湘，湖北的简称是鄂； D项广东的简称是粤，\n广西的简称是桂。其简称都不是源自西周封国国名， 故排除 ABD三项。  \nC项山东的简称是鲁 ，山西的简称是晋 ，其简称都是源自西周封国国名 。故C项\n正确。  \n故选： C。\n",
  "index": 0
},

Model Output

模型输出的格式如下所示：

{
  "index": 0,
  "year": "2010",
  "category": "（新课标）",
  "question": "1．（ 4分）西周分封制在**历史上影响深远。下列省、自治区中，其简称源\n自西周封国国名的是（   ） \nA．河南、河北  B．湖南、湖北  C．山东、山西  D．广东、广西\n",
  "standard_answer": [
    "C"
  ],
  "analysis": "西周分封的诸侯国主要有鲁齐燕卫宋晋 。A项河南的简称是豫 ，河北的\n简称是冀； B项湖南的简称是湘，湖北的简称是鄂； D项广东的简称是粤，\n广西的简称是桂。其简称都不是源自西周封国国名， 故排除 ABD三项。  \nC项山东的简称是鲁 ，山西的简称是晋 ，其简称都是源自西周封国国名 。故C项\n正确。  \n故选： C。\n",
  "model_answer": [
    "C"
  ],
  "model_output": "【解析】\n西周时期实行的分封制度是指周王朝把自己的领土划分成诸侯国，由周王封授给亲戚、功臣等人，以维护王权，加强**集权。这些封国的名称多以地理特征、山川、河流、国都等命名，如鲁、卫、楚、秦等。因此，本题考查的是哪个省、自治区的简称源自西周封国国名，根据选项，可以排除D选项，因为广东、广西的简称并非源自西周封国国名。再根据**地理位置和历史文化背景，可以得出答案。\n【答案】C <eoa>"
}

Evaluation

评测框架由如下部分组成：

File Name	Function
choice_bench.py	生成选择题答案
cloze_bench.py	生成填空题答案
subjective_bench.py	生成解答题答案
bench_function.py	包含评测相关的函数
选择题prompt.json	选择题使用的Prompt
填空题prompt.json	填空题使用的Prompt
解答题prompt.json	解答题使用的Prompt
choice_test.py	评测选择题答案

你可以通过调用OpenAI API Keys来运行 choice_bench.py/cloze_bench.py/subjective_bench.py 以生成答案。测评框架支持使用 gpt-3.5-turbo 来测评选择题、填空题和解答题；支持使用 text-davinci-003 来测评选择题（2010_2022语文现代文阅读除外）和填空题，因为部分题目长度可能超出模型的上下文长度。

你可以运行 choice_test.py 来评测选择题的答案。

Requirement

joblib==1.1.0
openai==0.27.2

OpenAndrus / GAOKAO-Bench

GAOKAO-bench

Introduction

Data Statistics

objective question

Subjective question

JSON format specification

Model Output

Evaluation

Requirement

GAOKAO-bench

Introduction

Data Statistics

客观题

主观题

json格式说明

Model Output

Evaluation

Requirement

About

Languages