Type error on stderr computation of accuracy
RenatoGeh opened this issue · comments
Hello,
I'm trying to evaluate on the ai2_arc
task group, but I keep getting this exception.
File "/home/renatolg/.local/lib/python3.10/site-packages/lm_eval/evaluator.py", line 234, in simple_evaluate
results = evaluate(
File "/home/renatolg/.local/lib/python3.10/site-packages/lm_eval/utils.py", line 288, in _wrapper
return fn(*args, **kwargs)
File "/home/renatolg/.local/lib/python3.10/site-packages/lm_eval/evaluator.py", line 514, in evaluate
] = lm_eval.api.metrics.pooled_sample_stderr(stderrs, sizes)
File "/home/renatolg/.local/lib/python3.10/site-packages/lm_eval/api/metrics.py", line 462, in pooled_sample_stderr
sum([(size - 1) * stderr**2 * size for size, stderr in zip(sizes, stderrs)])
File "/home/renatolg/.local/lib/python3.10/site-packages/lm_eval/api/metrics.py", line 462, in <listcomp>
sum([(size - 1) * stderr**2 * size for size, stderr in zip(sizes, stderrs)])
TypeError: unsupported operand type(s) for ** or pow(): 'list' and 'int'
Some quick debugging shows stderr = [[], []]
, while results
in evaluator.py:514
has value
results = defaultdict(<class 'dict'>, {'arc_easy': {'acc,none': 0.7133838383838383, 'samples': 2376, 'acc_stderr,none': [], 'acc_norm,none': 0.6586700336700336, 'acc_norm_stderr,none': []}, 'arc_challenge': {'acc,none': 0.3890784982935154, 'samples': 1172, 'acc_stderr,none': [], 'acc_norm,none': 0.4087030716723549, 'acc_norm_stderr,none': []}, 'ai2_arc': {'acc,none': 0.6062570462232244}})
It seems harness is having trouble computing the standard deviation of accuracy when faced with several tasks.
Is this a bug or am I doing something wrong here? Is there an option to disable these statistics and only get accuracy or do I have to manually change the code so it skips this function?
Thanks
What is the command you are running?
Can't seem to reproduce with lm_eval --model hf --tasks ai2_arc
, is this issue persisting?
Are there any changes you've made to the code?
I have managed to pinpoint exactly which flag causes this error. Here's a reproducible minimal example:
import lm_eval, torch
mgr = lm_eval.tasks.TaskManager()
lm_eval.simple_evaluate(model=lm_eval.models.huggingface.HFLM(pretrained="google/gemma-2b", dtype=torch.float32),
device="cuda", tasks=["ai2_arc"], limit=10, bootstrap_iters=0)
The error disappears when omitting the option bootstrap_iters=0
.
It now makes sense why the exception is being raised. However, it is unclear from documentation what bootstrap_iters
is supposed to be without going through the code.
#1789 should fix this--let me know if not!