Got stuck when evaluating MMLU
zhangliang-04 opened this issue · comments
Thanks for your open sourcing! i'm trying to evaluate Llama-7b-hf
on mmlu-fr
, a warning of Token indices sequence length is longer than the specified maximum sequence length for this model (5023 > 4096). Running this sequence through the model will result in indexing errors
occurs and it seems the process is stuck. Here is the callstack after keyboard interrupt:
Token indices sequence length is longer than the specified maximum sequence length for this model (5023 > 4096). Running this sequence through the model will result in indexing errors
^CTraceback (most recent call last):
File "/data2/zl/code/mlmm-evaluation/main.py", line 135, in <module>
main()
File "/data2/zl/code/mlmm-evaluation/main.py", line 108, in main
results = evaluator.open_llm_evaluate(
File "/data2/zl/code/mlmm-evaluation/lm_eval/utils.py", line 205, in _wrapper
return fn(*args, **kwargs)
File "/data2/zl/code/mlmm-evaluation/lm_eval/evaluator.py", line 79, in open_llm_evaluate
results = evaluate(
File "/data2/zl/code/mlmm-evaluation/lm_eval/utils.py", line 205, in _wrapper
return fn(*args, **kwargs)
File "/data2/zl/code/mlmm-evaluation/lm_eval/evaluator.py", line 262, in evaluate
resps = getattr(lm, reqtype)([req.args for req in reqs])
File "/data2/zl/code/mlmm-evaluation/lm_eval/base.py", line 181, in loglikelihood
context_enc = self.tok_encode(context)
File "/data2/zl/code/mlmm-evaluation/lm_eval/models/huggingface.py", line 361, in tok_encode
return self.tokenizer.encode(string, add_special_tokens=self.add_special_tokens)
File "/opt/conda/envs/lm_eval/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2569, in encode
encoded_inputs = self.encode_plus(
File "/opt/conda/envs/lm_eval/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2977, in encode_plus
return self._encode_plus(
File "/opt/conda/envs/lm_eval/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 576, in _encode_plus
batched_output = self._batch_encode_plus(
File "/opt/conda/envs/lm_eval/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 504, in _batch_encode_plus
encodings = self._tokenizer.encode_batch(
KeyboardInterrupt
it seems the process is stuck in the batched tokenizing, how to deal with this?
Is your problem solved ? please let me know since I am dealing with same issue
You can limit the #token feed to the model to match the max token length. This is rather a problem with lm_eval_harness than additional tasks.
You can limit the #token feed to the model to match the max token length. This is rather a problem with lm_eval_harness than additional tasks.
How do we do this exactly?
Thanks for your open sourcing! i'm trying to evaluate
Llama-7b-hf
onmmlu-fr
, a warning ofToken indices sequence length is longer than the specified maximum sequence length for this model (5023 > 4096). Running this sequence through the model will result in indexing errors
occurs and it seems the process is stuck. Here is the callstack after keyboard interrupt:Token indices sequence length is longer than the specified maximum sequence length for this model (5023 > 4096). Running this sequence through the model will result in indexing errors ^CTraceback (most recent call last): File "/data2/zl/code/mlmm-evaluation/main.py", line 135, in <module> main() File "/data2/zl/code/mlmm-evaluation/main.py", line 108, in main results = evaluator.open_llm_evaluate( File "/data2/zl/code/mlmm-evaluation/lm_eval/utils.py", line 205, in _wrapper return fn(*args, **kwargs) File "/data2/zl/code/mlmm-evaluation/lm_eval/evaluator.py", line 79, in open_llm_evaluate results = evaluate( File "/data2/zl/code/mlmm-evaluation/lm_eval/utils.py", line 205, in _wrapper return fn(*args, **kwargs) File "/data2/zl/code/mlmm-evaluation/lm_eval/evaluator.py", line 262, in evaluate resps = getattr(lm, reqtype)([req.args for req in reqs]) File "/data2/zl/code/mlmm-evaluation/lm_eval/base.py", line 181, in loglikelihood context_enc = self.tok_encode(context) File "/data2/zl/code/mlmm-evaluation/lm_eval/models/huggingface.py", line 361, in tok_encode return self.tokenizer.encode(string, add_special_tokens=self.add_special_tokens) File "/opt/conda/envs/lm_eval/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2569, in encode encoded_inputs = self.encode_plus( File "/opt/conda/envs/lm_eval/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2977, in encode_plus return self._encode_plus( File "/opt/conda/envs/lm_eval/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 576, in _encode_plus batched_output = self._batch_encode_plus( File "/opt/conda/envs/lm_eval/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 504, in _batch_encode_plus encodings = self._tokenizer.encode_batch( KeyboardInterrupt
it seems the process is stuck in the batched tokenizing, how to deal with this?
Did you fix this ??