⚖️
A Comprehensive Evaluation of LLMs on
Legal Judgment Prediction

[📜 Paper] • [🐱 GitHub]
Quick Start • Citation

Repo for "A Comprehensive Evaluation of Large Language Models on Legal Judgment Prediction"
published at EMNLP Findings 2023

💡 Introduction

To comprehensively evaluate the law capacity of large language models, we propose baseline solutions and conduct evaluation on the task of legal judgment prediction.

Motivation
Existing benchmarks, e.g., lm_eval_harness, mainly adopt a perplexity-based approach to select the most possible options as the prediction for classification tasks. However, LMs typically interact with humans in the way of open-ended generation. It is critical to directly evaluate the contents generated by greedy decoding or sampling.

Evaluation on LM Generated Contents
We propose an automatic evaluation pipeline to directly evaluate the generated contents for classification tasks.

Prompt LMs with task instruction to generate class labels. The generated contents may not strictly match standard label names.
Then, a parser is to map generated contents to labels, based on the text similarity scores.

LM + Retrieval System

To address the performance with retrieved information of LMs in legal domain, additional information, e.g., label candidates and similar cases as demonstrations, are included into prompts. Considering the combination of the two additional information, there are four sub-settings of prompts:

(free, zero shot): No additional information. Only task instruction.
(free, few shot): Task instruction + demonstrations
(multi, zero shot): Task instruction + label candidates (options)
(multi, few shot): Task instruction + label candidates + demonstrations

🔥 Leaderboard

rank	model	score	free-0shot	free-1shot	free-2shot	free-3shot	free-4shot	multi-0shot	multi-1shot	multi-2shot	multi-3shot	multi-4shot
1	gpt4	63.05	50.52	62.72	67.54	68.61	71.02	62.31	70.42	71.81	73.24	74.00
2	chatgpt	58.13	43.14	58.42	61.86	64.40	66.16	60.67	63.51	66.85	69.59	66.62
3	chatglm_6b	47.74	41.89	50.30	47.76	48.59	48.67	53.74	49.26	47.56	47.61	45.32
4	bloomz_7b	44.14	46.90	53.28	51.06	50.90	49.26	50.68	29.25	27.92	25.27	23.37
5	vicuna_13b	39.83	25.50	48.85	47.64	49.49	39.82	44.70	41.73	41.48	35.03	21.61

Note:

Metric: Macro-F1
$score = (free\text{-}0shot + free\text{-}2shot + multi\text{-}0shot + multi\text{-}2shot)/4$
OpenAI model names: gpt-3.5-turbo-0301, gpt-4-0314

🚀 Quick Start

⚙️ Install

git clone https://github.com/srhthu/LM-CompEval-Legal.git

# Enter the repo
cd LM-CompEval-Legal

pip install -r requirements.txt

bash download_data.sh
# Download evaluation dataset to data_hub/ljp
# Download model generated results to runs/paper_version

The data is availabel at Google Drive

Evaluate Models

There are totally 10 sub_tasks: {free,multi}-{0..4}.

Evaluate a Huggingface model on all sub_tasks:

CUDA_VISIBLE_DEVICES=0 python main.py \
--config ./config/default_hf.json \
--output_dir ./runs/test/<model_name> \
--model_type hf \
--model <path of model>

Evaluate a OpenAI model on all sub_tasks:

CUDA_VISIBLE_DEVICES=0 python main.py \
--config ./config/default_openai.json \
--output_dir ./runs/test/<model_name> \
--model_type openai \
--model <path of model>

To evaluate some of the whole settings, add one more argument, e.g.,

--sub_tasks 'free-0shot,free-2shot,multi-0shot,multi-2shot'

The huggingface paths of the evaluated models in the paper are

ChatGLM: THUDM/chatglm-6b
BLOOMZ: bigscience/bloomz-7b1-mt
Vicuna: lmsys/vicuna-13b-delta-v1.1

Features:

If the evaluation process is interupted, just run it again with the same parameters. The process saves model outputs immediately and will skip previous finished samples when resuming.

Samples that trigger a GPU out-of-memory error will be skipped. You can change the configurations and run the process again. (See suggested GPU configurations below)

Suggested GPU configurations

7B model
- 1 GPU with RAM around 24G (RTX 3090, A5000)
- If total RAM >=32G, e.g., 2*RTX3090 or 1*V100(32G), add the --speed argument for faster inference.
13B model
- 2 GPU with RAM >= 24G (e.g., 2*V100)
- If total RAM>=64G, e.g., 3*RTX3090 or 2*V100, add the --speed argument for faster inference

When context is long, e.g., in multi-4shot setting, 1 GPU of 24G RAM may be insufficient for 7B model. You have to eigher increase the number of GPUs or decrease the demonstration length (default to 500) by modifying the demo_max_len parameter in config/default_hf.json

Create Result table

After evaluating some models locally, the leaderboard can be generated in csv format:

python scripts/get_result_table.py \
--exp_dir runs/paper_version \
--metric f1  \
--save_path resources/paper_version_f1.csv

Citation

@misc{shui2023comprehensive,
      title={A Comprehensive Evaluation of Large Language Models on Legal Judgment Prediction}, 
      author={Ruihao Shui and Yixin Cao and Xiang Wang and Tat-Seng Chua},
      year={2023},
      eprint={2310.11761},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

About

Code for the paper "A Comprehensive Evaluation of Large Language Models on Legal Judgment Prediction"

Languages

Language:Python 95.1%Language:Shell 4.9%

⚖️ A Comprehensive Evaluation of LLMs on Legal Judgment Prediction