SeaEval / SeaEval

NAACL 2024: SeaEval for Multilingual Foundation Models: From Cross-Lingual Alignment to Cultural Reasoning

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SEAEVAL-Logo

πŸ”₯ SeaEval v2 πŸ”₯

arXiv Hugging Face Organization License

⚑ A repository for evaluating Multilingual LLMs in various tasks πŸš€ ⚑
⚑ SeaEval for Multilingual Foundation Models: From Cross-Lingual Alignment to Cultural Reasoning πŸš€ ⚑

Change log

  • July 2024: We are building SeaEval v2! With mixed prompts templates and more diverse datasets. v1 moved to v1-branch.

πŸ”§ Installation

Installation with pip:

pip install -r requirements.txt

⏩ Quick Start

The example is for a Llama-3-8B-Instruct model on mmlu dataset.

# The example is done with 1 A100 40G GPUs.
# This is a setting for just using 50 samples for evaluation.
MODEL_NAME=Meta-Llama-3-8B-Instruct
GPU=0
BATCH_SIZE=4
EVAL_MODE=zero_shot
OVERWRITE=True
NUMBER_OF_SAMPLES=50

DATASET=mmlu

bash eval.sh $DATASET $MODEL_NAME $BATCH_SIZE $EVAL_MODE $OVERWRITE $NUMBER_OF_SAMPLES $GPU 

# Results:
# The results would be like:
# {
#     "accuracy": 0.507615302109403,
#     "category_acc": {
#         "high_school_european_history": 0.6585365853658537,
#         "business_ethics": 0.6161616161616161,
#         "clinical_knowledge": 0.5,
#         "medical_genetics": 0.5555555555555556,
#    ...

The example is how to get started. To evaluate on the full datasets, please refer to Examples.

# Run the evaluation script for all datasets
bash demo.sh

πŸ“š Supported Models and Datasets

Datasets

Dataset Metrics Status
cross_xquad AC3, Consistency, Accuracy βœ…
cross_mmlu AC3, Consistency, Accuracy βœ…
cross_logiqa AC3, Consistency, Accuracy βœ…
sg_eval Accuracy βœ…
cn_eval Accuracy βœ…
us_eval Accuracy βœ…
ph_eval Accuracy βœ…
flores_ind2eng BLEU βœ…
flores_vie2eng BLEU βœ…
flores_zho2eng BLEU βœ…
flores_zsm2eng BLEU βœ…
mmlu Accuracy βœ…
c_eval Accuracy βœ…
cmmlu Accuracy βœ…
zbench Accuracy βœ…
indommlu Accuracy βœ…
ind_emotion Accuracy βœ…
ocnli Accuracy βœ…
c3 Accuracy βœ…
dream Accuracy βœ…
samsum ROUGE βœ…
dialogsum ROUGE βœ…
sst2 Accuracy βœ…
cola Accuracy βœ…
qqp Accuracy βœ…
mnli Accuracy βœ…
qnli Accuracy βœ…
wnli Accuracy βœ…
rte Accuracy βœ…
mrpc Accuracy βœ…

Models

Model Size Status
Llama-3-8B-Instruct 8B βœ…
-- 8B TODO

How to evaluate your own model?

To use SeaEval to evaluate your own model, you can just add your model to model.py and model_src accordingly.

πŸ“š Citation

If you find our work useful, please consider citing our paper!

SeaEval for Multilingual Foundation Models: From Cross-Lingual Alignment to Cultural Reasoning

@article{SeaEval,
  title={SeaEval for Multilingual Foundation Models: From Cross-Lingual Alignment to Cultural Reasoning},
  author={Wang, Bin and Liu, Zhengyuan and Huang, Xin and Jiao, Fangkai and Ding, Yang and Aw, Ai Ti and Chen, Nancy F.},
  journal={NAACL},
  year={2024}
}

CRAFT: Extracting and Tuning Cultural Instructions from the Wild

@article{wang2024craft,
  title={CRAFT: Extracting and Tuning Cultural Instructions from the Wild},
  author={Wang, Bin and Lin, Geyu and Liu, Zhengyuan and Wei, Chengwei and Chen, Nancy F},
  journal={ACL 2024 - C3NLP Workshop},
  year={2024}
}

CrossIn: An Efficient Instruction Tuning Approach for Cross-Lingual Knowledge Alignment

@article{lin2024crossin,
  title={CrossIn: An Efficient Instruction Tuning Approach for Cross-Lingual Knowledge Alignment},
  author={Lin, Geyu and Wang, Bin and Liu, Zhengyuan and Chen, Nancy F},
  journal={arXiv preprint arXiv:2404.11932},
  year={2024}
}

Contact: seaeval_help@googlegroups.com

About

NAACL 2024: SeaEval for Multilingual Foundation Models: From Cross-Lingual Alignment to Cultural Reasoning

License:Other


Languages

Language:Python 96.1%Language:Shell 3.9%