M3KE: A Massive Multi-Level Multi-Subject Knowledge Evaluation Benchmark for Chinese Large Language Models

Introduction • How to use M3KE • Evaluation Leaderboard • Citation

Introduction

We propose M3KE, a Massive Multi-Level Multi-Subject Knowledge Evaluation benchmark, which is developed to measure knowledge acquired by Chinese large language models in zero- and few-shot settings. We have collected 20,477 questions from 71 tasks. Our selection covers all major levels of Chinese education system, ranging from the primary school to college, as well as a wide variety of subjects, including humanities, history, politics, law, education, psychology, science, technology, art and religion. All questions are multiple-choice questions with four options, hence guaranteeing a standardized and unified assessment process.

We’ve assessed and will continue to assess a number of Chinese large language models on our benchmark. The currently assessed models are either only pre-trained on massive data or pre-trained + fine-tuned with SFT or RLHF. Model sizes vary from 335M to 175B parameters.

This is collaborative research effort between the Natural Language Processing Laboratory at Tianjin University and Huawei Noah’s Ark Lab.

The comparison between M3KE and other relevant benchmarks.

Benchmark	Language	#Tasks	#Questions
MMLU	En	57	15908
AGIEval	EN, Zh	20	8062
MMCU	Zh	51	11900
M3KE	Zh	71	20477

All 71 tasks displayed according to their subjects and levels.

	Arts Humanities	Social Sciences	Natural Sciences	Others
Primary school	Chinese		Math
Junior high school	Chinese, History	Politics	Math, Physics, Biology, Chemistry, Geography
High school	Chinese, History	Politics	Math, Physics, Biology, Chemistry, Geography
College	Modern History, History Foundation, Modern World History	Chinese Constitutional Law, History of Chinese Education, History of the Chinese Legal System, Developmental and Educational Psychology, History of Foreign Education, Experimental Psychology, Introduction to Psychology, Moral Cultivation, Psychology of Teaching, Principles of Pedagogy, Educational Research Methods, Current Affairs and Politics, Introduction to Mao Tsetung Thoughts, Civil Law, Jurisprudence, Sociology, Basic Principle of Marxism, Criminal Jurisprudence, Outline of Chinese Modern History	Humanistic Medicine, Internal Medicine, Animal Physiology, Surgical Sciences, Operating Systems, Data Structures, Probability Theory, Biochemistry, Biochemistry and Pathology, Physiology, Principles of Computer Composition, Computer Networks,Advanced Mathematics, Linear Algebra, Stomatology, Anthropotomy, Pharmacology, Immunology	Management, Economics
Others	Film, Music, Dance, Fine Arts		Computer Grade Exam (Computer Fundamentals, Programming Languages)	Chinese Medicine, Ancient Chinese Language, Novels, Religion, Chinese Civil Service Examination

How to use M3KE

There are several approaches available for loading the M3KE dataset. Below, we provide three methods as references:

Method 1

First, clone this repository manually using the following command:

git clone https://github.com/tjunlp-lab/M3KE.git

Then, load the data using Python's built-in json package as shown below:

import json

fin = open("path/to/M3KE/test/Computer Programming Language-Natural Sciences-Other.jsonl", mode="r", encoding="utf-8")

for json_line in fin:
    json_data = json.loads(json_line)
    print(json_data)

Method 2

First, clone this repository manually using the following command:

git clone https://github.com/tjunlp-lab/M3KE.git

Then, load the data using the datasets package as shown below:

from datasets import load_dataset

ds = load_dataset("json", data_files="path/to/M3KE/test/Computer Programming Language-Natural Sciences-Other.jsonl", split="train")
print(ds)
"""
Dataset({
    features: ['id', 'question', 'A', 'B', 'C', 'D', 'answer'],
    num_rows: 236
})
"""

print(ds[0])
"""
{'id': 0, 'question': '下面判断正确的是？', 'A': 'char str[10]={"china"}; 等价于 char str[10];str[]="china";', 'B': 'char *s="china"; 等价于 char *s;s="china"; ', 'C': 'char *a="china"; 等价于 char *a;*a="china";', 'D': 'char c[6]="china",d[6]="china"; 等 价 于 char c[6]=d[6]="china"; ', 'answer': ''}
"""

Method 3

The M3KE dataset has been uploaded to the HuggingFace Datasets Hub, allowing for easy loading without manually cloning this repository. Use the following code to load the dataset:

from datasets import load_dataset

ds = load_dataset(
    path="TJUNLP/M3KE", 
    name="Computer Programming Language-Natural Sciences-Other"
)
print(ds)
"""
DatasetDict({
    test: Dataset({
        features: ['id', 'question', 'A', 'B', 'C', 'D', 'answer'],
        num_rows: 236
    })
    dev: Dataset({
        features: ['id', 'question', 'A', 'B', 'C', 'D', 'answer'],
        num_rows: 5
    })
})
"""

print(ds["test"][0])
"""
{'id': 0, 'question': '下面判断正确的是？', 'A': 'char str[10]={"china"}; 等价于 char str[10];str[]="china";', 'B': 'char *s="china"; 等价于 char *s;s="china"; ', 'C': 'char *a="china"; 等价于 char *a;*a="china";', 'D': 'char c[6]="china",d[6]="china"; 等 价 于 char c[6]=d[6]="china"; ', 'answer': ''}
"""

Evaluation Leaderboard (more models to be added)

If you want to have your Chinese large language models assessed and added to the leaderboard, please feel free to contact us via liuc_09@tju.edu.cn or submit a pull request.

Average zero-shot accuracy of each evaluated model on four major discipline clusters:

Models	Arts&Humanities	Social Sciences	Nature Sciences	Others	Average
GLM-335M	0.070	0.046	0.084	0.044	0.062
Bloom-7B	0.163	0.159	0.161	0.158	0.161
GLM-10B	0.180	0.229	0.219	0.150	0.197
GLM-130B	0.326	0.352	0.274	0.359	0.328
ChatGLM-6B	0.246	0.267	0.168	0.263	0.236
MOSS-SFT-16B	0.260	0.263	0.207	0.275	0.251
BELLE-7B-0.2M	0.247	0.296	0.260	0.260	0.266
BELLE-7B-2M	0.328	0.367	0.282	0.355	0.333
LLaMA-7B-2M	0.256	0.227	0.206	0.244	0.233
LLaMA-13B-2M	0.294	0.316	0.246	0.279	0.284
AquilaChat-7B	0.256	0.253	0.229	0.246	0.246
GPT3.5-turbo	0.460	0.538	0.444	0.481	0.481
GPT-4	0.588	0.676	0.623	0.665	0.638

Average five-shot accuracy of each evaluated model on four major discipline clusters:

Models	Arts&Humanities	Social Sciences	Nature Sciences	Others	Average
GLM-335M	0.220	0.247	0.193	0.126	0.196
Bloom-7B	0.247	0.260	0.235	0.246	0.247
GLM-10B	0.294	0.304	0.232	0.211	0.260
GLM-130B	0.297	0.329	0.246	0.228	0.275
ChatGLM-6B	0.188	0.175	0.121	0.198	0.171
MOSS-SFT-16B	0.266	0.264	0.258	0.284	0.268
BELLE-7B-0.2M	0.292	0.327	0.273	0.307	0.299
BELLE-7B-2M	0.287	0.309	0.284	0.313	0.298
LLaMA-7B-2M	0.273	0.257	0.222	0.250	0.251
LLaMA-13B-2M	0.241	0.234	0.138	0.219	0.208
AquilaChat-7B	0.257	0.249	0.248	0.264	0.255
baichuan-7B	0.266	0.264	0.175	0.241	0.237
GPT3.5-turbo	0.453	0.540	0.464	0.476	0.483

Citation

@misc{liu2023m3ke,
    title={M3KE: A Massive Multi-Level Multi-Subject Knowledge Evaluation Benchmark for Chinese Large Language Models},
    author={Chuang Liu and Renren Jin and Yuqi Ren and Linhao Yu and Tianyu Dong and Xiaohan Peng and Shuting Zhang and Jianxiang Peng and Peiyi Zhang and Qingqing Lyu and Xiaowen Su and Qun Liu and Deyi Xiong},
    year={2023},
    eprint={2305.10263},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

qcdcool / M3KE