qcdcool / M3KE

A Massive Multi-Level Multi-Subject Knowledge Evaluation benchmark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

M3KE: A Massive Multi-Level Multi-Subject Knowledge Evaluation Benchmark for Chinese Large Language Models

IntroductionHow to use M3KEEvaluation LeaderboardCitation

Introduction

We propose M3KE, a Massive Multi-Level Multi-Subject Knowledge Evaluation benchmark, which is developed to measure knowledge acquired by Chinese large language models in zero- and few-shot settings. We have collected 20,477 questions from 71 tasks. Our selection covers all major levels of Chinese education system, ranging from the primary school to college, as well as a wide variety of subjects, including humanities, history, politics, law, education, psychology, science, technology, art and religion. All questions are multiple-choice questions with four options, hence guaranteeing a standardized and unified assessment process.

We’ve assessed and will continue to assess a number of Chinese large language models on our benchmark. The currently assessed models are either only pre-trained on massive data or pre-trained + fine-tuned with SFT or RLHF. Model sizes vary from 335M to 175B parameters.

This is collaborative research effort between the Natural Language Processing Laboratory at Tianjin University and Huawei Noah’s Ark Lab.

The comparison between M3KE and other relevant benchmarks.

Benchmark Language #Tasks #Questions
MMLU En 57 15908
AGIEval EN, Zh 20 8062
MMCU Zh 51 11900
M3KE Zh 71 20477

All 71 tasks displayed according to their subjects and levels.

Arts Humanities Social Sciences Natural Sciences Others
Primary school Chinese Math
Junior high school Chinese, History Politics Math, Physics, Biology, Chemistry, Geography
High school Chinese, History Politics Math, Physics, Biology, Chemistry, Geography
College Modern History, History Foundation, Modern World History Chinese Constitutional Law, History of Chinese Education, History of the Chinese Legal System, Developmental and Educational Psychology, History of Foreign Education, Experimental Psychology, Introduction to Psychology, Moral Cultivation, Psychology of Teaching, Principles of Pedagogy, Educational Research Methods, Current Affairs and Politics, Introduction to Mao Tsetung Thoughts, Civil Law, Jurisprudence, Sociology, Basic Principle of Marxism, Criminal Jurisprudence, Outline of Chinese Modern History Humanistic Medicine, Internal Medicine, Animal Physiology, Surgical Sciences, Operating Systems, Data Structures, Probability Theory, Biochemistry, Biochemistry and Pathology, Physiology, Principles of Computer Composition, Computer Networks,Advanced Mathematics, Linear Algebra, Stomatology, Anthropotomy, Pharmacology, Immunology Management, Economics
Others Film, Music, Dance, Fine Arts Computer Grade Exam (Computer Fundamentals, Programming Languages) Chinese Medicine, Ancient Chinese Language, Novels, Religion, Chinese Civil Service Examination

How to use M3KE

There are several approaches available for loading the M3KE dataset. Below, we provide three methods as references:

Method 1

First, clone this repository manually using the following command:

git clone https://github.com/tjunlp-lab/M3KE.git

Then, load the data using Python's built-in json package as shown below:

import json

fin = open("path/to/M3KE/test/Computer Programming Language-Natural Sciences-Other.jsonl", mode="r", encoding="utf-8")

for json_line in fin:
    json_data = json.loads(json_line)
    print(json_data)

Method 2

First, clone this repository manually using the following command:

git clone https://github.com/tjunlp-lab/M3KE.git

Then, load the data using the datasets package as shown below:

from datasets import load_dataset

ds = load_dataset("json", data_files="path/to/M3KE/test/Computer Programming Language-Natural Sciences-Other.jsonl", split="train")
print(ds)
"""
Dataset({
    features: ['id', 'question', 'A', 'B', 'C', 'D', 'answer'],
    num_rows: 236
})
"""

print(ds[0])
"""
{'id': 0, 'question': '下面判断正确的是?', 'A': 'char str[10]={"china"}; 等价于 char str[10];str[]="china";', 'B': 'char *s="china"; 等价于 char *s;s="china"; ', 'C': 'char *a="china"; 等价于 char *a;*a="china";', 'D': 'char c[6]="china",d[6]="china"; 等 价 于 char c[6]=d[6]="china"; ', 'answer': ''}
"""

Method 3

The M3KE dataset has been uploaded to the HuggingFace Datasets Hub, allowing for easy loading without manually cloning this repository. Use the following code to load the dataset:

from datasets import load_dataset

ds = load_dataset(
    path="TJUNLP/M3KE", 
    name="Computer Programming Language-Natural Sciences-Other"
)
print(ds)
"""
DatasetDict({
    test: Dataset({
        features: ['id', 'question', 'A', 'B', 'C', 'D', 'answer'],
        num_rows: 236
    })
    dev: Dataset({
        features: ['id', 'question', 'A', 'B', 'C', 'D', 'answer'],
        num_rows: 5
    })
})
"""

print(ds["test"][0])
"""
{'id': 0, 'question': '下面判断正确的是?', 'A': 'char str[10]={"china"}; 等价于 char str[10];str[]="china";', 'B': 'char *s="china"; 等价于 char *s;s="china"; ', 'C': 'char *a="china"; 等价于 char *a;*a="china";', 'D': 'char c[6]="china",d[6]="china"; 等 价 于 char c[6]=d[6]="china"; ', 'answer': ''}
"""

Evaluation Leaderboard (more models to be added)

If you want to have your Chinese large language models assessed and added to the leaderboard, please feel free to contact us via liuc_09@tju.edu.cn or submit a pull request.

Average zero-shot accuracy of each evaluated model on four major discipline clusters:

Models Arts&Humanities Social Sciences Nature Sciences Others Average
GLM-335M 0.070 0.046 0.084 0.044 0.062
Bloom-7B 0.163 0.159 0.161 0.158 0.161
GLM-10B 0.180 0.229 0.219 0.150 0.197
GLM-130B 0.326 0.352 0.274 0.359 0.328
ChatGLM-6B 0.246 0.267 0.168 0.263 0.236
MOSS-SFT-16B 0.260 0.263 0.207 0.275 0.251
BELLE-7B-0.2M 0.247 0.296 0.260 0.260 0.266
BELLE-7B-2M 0.328 0.367 0.282 0.355 0.333
LLaMA-7B-2M 0.256 0.227 0.206 0.244 0.233
LLaMA-13B-2M 0.294 0.316 0.246 0.279 0.284
AquilaChat-7B 0.256 0.253 0.229 0.246 0.246
GPT3.5-turbo 0.460 0.538 0.444 0.481 0.481
GPT-4 0.588 0.676 0.623 0.665 0.638

Average five-shot accuracy of each evaluated model on four major discipline clusters:

Models Arts&Humanities Social Sciences Nature Sciences Others Average
GLM-335M 0.220 0.247 0.193 0.126 0.196
Bloom-7B 0.247 0.260 0.235 0.246 0.247
GLM-10B 0.294 0.304 0.232 0.211 0.260
GLM-130B 0.297 0.329 0.246 0.228 0.275
ChatGLM-6B 0.188 0.175 0.121 0.198 0.171
MOSS-SFT-16B 0.266 0.264 0.258 0.284 0.268
BELLE-7B-0.2M 0.292 0.327 0.273 0.307 0.299
BELLE-7B-2M 0.287 0.309 0.284 0.313 0.298
LLaMA-7B-2M 0.273 0.257 0.222 0.250 0.251
LLaMA-13B-2M 0.241 0.234 0.138 0.219 0.208
AquilaChat-7B 0.257 0.249 0.248 0.264 0.255
baichuan-7B 0.266 0.264 0.175 0.241 0.237
GPT3.5-turbo 0.453 0.540 0.464 0.476 0.483

Citation

@misc{liu2023m3ke,
    title={M3KE: A Massive Multi-Level Multi-Subject Knowledge Evaluation Benchmark for Chinese Large Language Models},
    author={Chuang Liu and Renren Jin and Yuqi Ren and Linhao Yu and Tianyu Dong and Xiaohan Peng and Shuting Zhang and Jianxiang Peng and Peiyi Zhang and Qingqing Lyu and Xiaowen Su and Qun Liu and Deyi Xiong},
    year={2023},
    eprint={2305.10263},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

About

A Massive Multi-Level Multi-Subject Knowledge Evaluation benchmark