ValueError occurs when try to evaluate task "bigbench_multiple_choice"

Question

ValueError occurs when try to evaluate task "bigbench_multiple_choice"

abzb1 opened this issue 2 months ago · comments

Hello,

I'm trying to evaluate some hf🤗 models on lm-eval. When I use the "bigbench_multiple_choice" task, I encounter a ValueError in certain subtasks. I'd appreciate help with resolving this.

below is my script
lm_eval --model hf \ --model_args pretrained=allenai/OLMo-7B,trust_remote_code=true \ --tasks bigbench_multiple_choice \ --device cuda:0 \ --batch_size auto \ --log_samples \ --output_path logit_result

I also tried with some other models(llama 3, mistral ), but it still makes error
like
"Task: bigbench_conlang_translation_multiple_choice] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
....
File "", line 1, in top-level template code
ValueError: 'The teacher carries taro here.' is not in list"

However, when I select only one specific task, for example, "bigbench_implicit_relations_multiple_choice," it runs without any issues. I suspect the error might be occurring during the task configuration stage. Do you have any ideas?

my environment is like below with RTX A6000 GPU, UBUNTU 22.04
absl-py 2.1.0
accelerate 0.29.3
aiohttp 3.9.5
aiosignal 1.3.1
attrs 23.2.0
certifi 2024.2.2
chardet 5.2.0
charset-normalizer 3.3.2
click 8.1.7
colorama 0.4.6
DataProperty 1.0.1
datasets 2.18.0
dill 0.3.8
evaluate 0.4.1
filelock 3.13.4
frozenlist 1.4.1
fsspec 2024.2.0
huggingface-hub 0.22.2
idna 3.7
Jinja2 3.1.3
joblib 1.4.0
jsonlines 4.0.0
lm_eval 0.4.2 /home/ohs/eval/lm-evaluation-harness
lxml 5.2.1
MarkupSafe 2.1.5
mbstrdecoder 1.1.3
more-itertools 10.2.0
mpmath 1.3.0
multidict 6.0.5
multiprocess 0.70.16
networkx 3.3
nltk 3.8.1
numexpr 2.10.0
numpy 1.26.4
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.19.3
nvidia-nvjitlink-cu12 12.4.127
nvidia-nvtx-cu12 12.1.105
packaging 24.0
pandas 2.2.2
pathvalidate 3.2.0
peft 0.10.0
pip 24.0
portalocker 2.8.2
psutil 5.9.8
pyarrow 15.0.2
pyarrow-hotfix 0.6
pybind11 2.12.0
pytablewriter 1.2.0
python-dateutil 2.9.0.post0
pytz 2024.1
PyYAML 6.0.1
regex 2024.4.16
requests 2.31.0
responses 0.18.0
rouge-score 0.1.2
sacrebleu 2.4.2
safetensors 0.4.3
scikit-learn 1.4.2
scipy 1.13.0
setuptools 65.5.0
six 1.16.0
sqlitedict 2.1.0
sympy 1.12
tabledata 1.3.3
tabulate 0.9.0
tcolorpy 0.1.4
threadpoolctl 3.4.0
tokenizers 0.19.1
torch 2.2.2
tqdm 4.66.2
tqdm-multiprocess 0.0.11
transformers 4.41.0.dev0 /home/ohs/eval/transformers
triton 2.2.0
typepy 1.3.2
typing_extensions 4.11.0
tzdata 2024.1
urllib3 2.2.1
word2number 1.1
xxhash 3.4.1
yarl 1.9.4
zstandard 0.22.0`

Hongseok Oh commented 17 days ago

#1686

Hongseok Oh · Answer 1 · Wed Apr 24 2024 10:31:43 GMT+0800 (China Standard Time)

I think the error is caused at here (or anywhere else) line 9: doc_to_target: "{{multiple_choice_targets.index(targets[0])}}"
I think the index method raises the ValueError

I'm looking at https://huggingface.co/datasets/hails/bigbench
maybe this repo can shed light? lol

Hongseok Oh · Answer 2 · Wed Apr 24 2024 10:58:58 GMT+0800 (China Standard Time)

Oh, I found there are some subtasks that are not fit in the format.
I did not conduct a complete survey, but some of the data from above and the data I encountered problems with are as follows.

strategyqa_zero_shot

ascii_word_recognition_zero_shot
auto_categorization_zero_shot
auto_debugging_zero_shot
bridging_anaphora_resolution_barqa_zero_shot
chess_state_tracking_zero_shot?row=2
chinese_remainder_theorem_zero_shot
codenames_zero_shot
conlang_translation_zero_shot
cryptonite_zero_shot
disfl_qa_zero_shot
few_shot_nlg_zero_shot
gem_zero_shot
hindi_question_answering_zero_shot
...

Someone who want to evaluate big bench multiple choice should carefully watch the data whether it supports or not 😃