SUFE-AIFLM-Lab / FinEval

FinEval是一个包含金融、经济、会计和证书等领域高质量多项选择题的集合。

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Welcome to FinEval

Large language models (LLMs) have demonstrated exceptional performance in various natural language processing tasks, yet their efficacy in more challenging and domain-specific tasks remains largely unexplored. This paper presents FinEval, a benchmark specifically designed for the financial domain knowledge in the LLMs.

FinEval is a collection of high-quality multiple-choice questions covering finance,economy, accounting, and certificate. It includes 4,661 questions spanning 34 different academic subjects. To ensure a comprehensive model performance evaluation, FinEval employs various methods, including zero-shot, few-shot, answer-only,and chain-of-thought prompts. Evaluating state-of-the-art Chinese and English LLMs on FinEval, the results show that only GPT-4 has an average accuracy rate close to 70%, indicating significant growth potential for LLMs in the financial domain knowledge. Our work offers a more comprehensive financial knowledge evaluation benchmark, utilizing data of mock exams and covering a wide range of evaluated LLMs.



Contents

Performance Leaderboard

We divide the evaluation into Answer Only and Chain of Thought. For examples of prompts for both methods, please refer to zero-shot for Answer Only, few-shot for Answer Only, and Chain of Thought.

Below is the average accuracy(%) on the test split. We report the average accuracy over the subjects within each category. "Average" column indicates the average accuracy over all the subjects. Notably, we only report the results from each model under the best setting, which is determined by the highest average accuracy achieved among four settings (i.e., zero- and few-shot learning with and without CoT):

Model Size Finance Economy Accounting Certificate Average
GPT-4 unknown 71.0 74.5 59.3 70.4 68.6
ChatGPT 175B 59.3 61.6 45.2 55.1 55.0
Qwen-7B 7B 54.5 54.4 50.3 55.8 53.8
Qwen-Chat-7B 7B 51.5 52.1 44.5 53.6 50.5
Baichuan-13B-Base 13B 52.6 50.2 43.4 53.5 50.1
Baichuan-13B-Chat 13B 51.6 51.1 41.7 52.8 49.4
ChatGLM2-6B 6B 46.5 46.4 44.5 51.5 47.4
InternLM-7B 7B 49.0 49.2 40.5 49.4 47.1
InternLM-Chat-7B 7B 48.4 49.1 40.8 49.5 47.0
LLaMA-2-Chat-70B 70B 47.1 46.7 41.5 45.7 45.2
Falcon-40B 40B 45.4 43.2 35.8 44.8 42.4
Baichuan-7B 7B 44.9 41.5 34.9 45.6 42.0
LLaMA-2-Chat-13B 13B 41.6 38.4 34.1 42.1 39.3
Ziya-LLaMA-13B-v1 13B 43.3 36.9 34.3 41.2 39.3
Bloomz-7b1-mt 7B 41.4 42.1 32.5 39.7 38.8
LLaMA-2-13B 13B 39.5 38.6 31.6 39.6 37.4
ChatGLM-6B 6B 38.8 36.2 33.8 39.1 37.2
Chinese-Llama-2-7B 7B 37.8 37.8 31.4 36.7 35.9
Chinese-Alpaca-Plus-7B 7B 30.5 33.4 32.7 38.5 34.0
moss-moon-003-sft 16B 35.6 34.3 28.7 35.6 33.7
LLaMA-2-Chat-7B 7B 35.6 31.8 31.9 34.0 33.5
LLaMA-2-7B 7B 34.9 36.4 31.4 31.6 33.4
AquilaChat-7B 7B 34.2 31.3 29.8 36.2 33.1
moss-moon-003-base 16B 32.2 33.1 29.2 30.7 31.2
Aquila-7B 7B 27.1 31.6 32.4 33.6 31.2
LLaMA-13B 13B 33.1 29.7 27.2 33.6 31.1
Falcon-7B 7B 28.5 28.2 27.5 27.4 27.9

Installation

Below are the steps for quick installation. For detailed instructions, please refer to the Installation Guide.

   conda create --name fineval_venv python=3.8
   conda activate fineval_venv
    git clone https://github.com/SUFE-AIFLM-Lab/FinEval
    cd FinEval
    pip install -r requirements.txt

Dataset Preparation

Download the dataset using Hugging Face datasets. Run the command to manually download and decompress, run the following command in the Fineval/code project directory, and rename it to data, and prepare the dataset to the FinEval/code/data directory.

cd code/data
wget https://huggingface.co/datasets/SUFE-AIFLM-Lab/FinEval/resolve/main/FinEval.zip
unzip FinEval.zip

After the dataset is decompressed, the file format is as follows:

  • -----data ----dev: The dev set for each subject contains five demonstration examples and explanations provided by the few-shot assessment ----val: The val set is mainly used for the self-test model score, and the score can be obtained directly ----test: used for the final evaluation of the model, the answers of the test set will not be made public, users are required to submit the evaluation results of submission.json, and the obtained scores will participate in the final leaderboard

Evaluation

Please read Get started quickly to learn how to run an evaluation task.

Supporting New Datasets and Models

If you need to incorporate a new dataset for evaluation, please refer to Add a dataset.

If you need to load a new model, please refer to Add a Model.

How to Submit

The location for saving the assessment results is: output_path=$PROJ_HOME/output_dir/${exp_name}/$exp_date. Within this folder, the submission.json file is generated automatically. Users only need to submit this file.

Instructions for the saving location can be found in the How to run section.

## The key inside each subject is the "id" field in the dataset
{
    "banking_practitioner_qualification_certificate": {
        "0": "A",
        "1": "B",
        "2": "B",
        ...
    },
    
    "Subject Name":{
    "0":"Answer1",
    "1":"Answer2",
    ...
    }
    ....
}

You can submit the generated submission.json file to zhang.liwen@shufe.edu.cn by email.

Citation

@misc{2308.09975,
Author = {Liwen Zhang and Weige Cai and Zhaowei Liu and Zhi Yang and Wei Dai and Yujie Liao and Qianru Qin and Yifei Li and Xingyu Liu and Zhiqiang Liu and Zhoufan Zhu and Anbo Wu and Xin Guo and Yun Chen},
Title = {FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for Large Language Models},
Year = {2023},
Eprint = {arXiv:2308.09975},
}

About

FinEval是一个包含金融、经济、会计和证书等领域高质量多项选择题的集合。

License:Apache License 2.0


Languages

Language:Python 93.9%Language:Shell 6.1%