β‘ A repository for evaluating Multilingual LLMs in various tasks π β‘
β‘ SeaEval for Multilingual Foundation Models: From Cross-Lingual Alignment to Cultural Reasoning π β‘
- July 2024: We are building SeaEval v2! With mixed prompts templates and more diverse datasets. v1 moved to v1-branch.
Installation with pip:
pip install -r requirements.txt
The example is for a Llama-3-8B-Instruct
model on mmlu
dataset.
# The example is done with 1 A100 40G GPUs.
# This is a setting for just using 50 samples for evaluation.
MODEL_NAME=Meta-Llama-3-8B-Instruct
GPU=0
BATCH_SIZE=4
EVAL_MODE=zero_shot
OVERWRITE=True
NUMBER_OF_SAMPLES=50
DATASET=mmlu
bash eval.sh $DATASET $MODEL_NAME $BATCH_SIZE $EVAL_MODE $OVERWRITE $NUMBER_OF_SAMPLES $GPU
# Results:
# The results would be like:
# {
# "accuracy": 0.507615302109403,
# "category_acc": {
# "high_school_european_history": 0.6585365853658537,
# "business_ethics": 0.6161616161616161,
# "clinical_knowledge": 0.5,
# "medical_genetics": 0.5555555555555556,
# ...
The example is how to get started. To evaluate on the full datasets, please refer to Examples.
# Run the evaluation script for all datasets
bash demo.sh
Dataset | Metrics | Status |
---|---|---|
cross_xquad | AC3, Consistency, Accuracy | β |
cross_mmlu | AC3, Consistency, Accuracy | β |
cross_logiqa | AC3, Consistency, Accuracy | β |
sg_eval | Accuracy | β |
cn_eval | Accuracy | β |
us_eval | Accuracy | β |
ph_eval | Accuracy | β |
flores_ind2eng | BLEU | β |
flores_vie2eng | BLEU | β |
flores_zho2eng | BLEU | β |
flores_zsm2eng | BLEU | β |
mmlu | Accuracy | β |
c_eval | Accuracy | β |
cmmlu | Accuracy | β |
zbench | Accuracy | β |
indommlu | Accuracy | β |
ind_emotion | Accuracy | β |
ocnli | Accuracy | β |
c3 | Accuracy | β |
dream | Accuracy | β |
samsum | ROUGE | β |
dialogsum | ROUGE | β |
sst2 | Accuracy | β |
cola | Accuracy | β |
qqp | Accuracy | β |
mnli | Accuracy | β |
qnli | Accuracy | β |
wnli | Accuracy | β |
rte | Accuracy | β |
mrpc | Accuracy | β |
Model | Size | Status |
---|---|---|
Llama-3-8B-Instruct | 8B | β |
-- | 8B | TODO |
To use SeaEval to evaluate your own model, you can just add your model to model.py
and model_src
accordingly.
If you find our work useful, please consider citing our paper!
SeaEval for Multilingual Foundation Models: From Cross-Lingual Alignment to Cultural Reasoning
@article{SeaEval,
title={SeaEval for Multilingual Foundation Models: From Cross-Lingual Alignment to Cultural Reasoning},
author={Wang, Bin and Liu, Zhengyuan and Huang, Xin and Jiao, Fangkai and Ding, Yang and Aw, Ai Ti and Chen, Nancy F.},
journal={NAACL},
year={2024}
}
CRAFT: Extracting and Tuning Cultural Instructions from the Wild
@article{wang2024craft,
title={CRAFT: Extracting and Tuning Cultural Instructions from the Wild},
author={Wang, Bin and Lin, Geyu and Liu, Zhengyuan and Wei, Chengwei and Chen, Nancy F},
journal={ACL 2024 - C3NLP Workshop},
year={2024}
}
CrossIn: An Efficient Instruction Tuning Approach for Cross-Lingual Knowledge Alignment
@article{lin2024crossin,
title={CrossIn: An Efficient Instruction Tuning Approach for Cross-Lingual Knowledge Alignment},
author={Lin, Geyu and Wang, Bin and Liu, Zhengyuan and Chen, Nancy F},
journal={arXiv preprint arXiv:2404.11932},
year={2024}
}
Contact: seaeval_help@googlegroups.com