EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.

Home Page:https://www.eleuther.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Type error on stderr computation of accuracy

RenatoGeh opened this issue · comments

Hello,

I'm trying to evaluate on the ai2_arc task group, but I keep getting this exception.

File "/home/renatolg/.local/lib/python3.10/site-packages/lm_eval/evaluator.py", line 234, in simple_evaluate
    results = evaluate(
  File "/home/renatolg/.local/lib/python3.10/site-packages/lm_eval/utils.py", line 288, in _wrapper
    return fn(*args, **kwargs)
  File "/home/renatolg/.local/lib/python3.10/site-packages/lm_eval/evaluator.py", line 514, in evaluate
    ] = lm_eval.api.metrics.pooled_sample_stderr(stderrs, sizes)
  File "/home/renatolg/.local/lib/python3.10/site-packages/lm_eval/api/metrics.py", line 462, in pooled_sample_stderr
    sum([(size - 1) * stderr**2 * size for size, stderr in zip(sizes, stderrs)])
  File "/home/renatolg/.local/lib/python3.10/site-packages/lm_eval/api/metrics.py", line 462, in <listcomp>
    sum([(size - 1) * stderr**2 * size for size, stderr in zip(sizes, stderrs)])
TypeError: unsupported operand type(s) for ** or pow(): 'list' and 'int'

Some quick debugging shows stderr = [[], []], while results in evaluator.py:514 has value

results = defaultdict(<class 'dict'>, {'arc_easy': {'acc,none': 0.7133838383838383, 'samples': 2376, 'acc_stderr,none': [], 'acc_norm,none': 0.6586700336700336, 'acc_norm_stderr,none': []}, 'arc_challenge': {'acc,none': 0.3890784982935154, 'samples': 1172, 'acc_stderr,none': [], 'acc_norm,none': 0.4087030716723549, 'acc_norm_stderr,none': []}, 'ai2_arc': {'acc,none': 0.6062570462232244}})

It seems harness is having trouble computing the standard deviation of accuracy when faced with several tasks.

Is this a bug or am I doing something wrong here? Is there an option to disable these statistics and only get accuracy or do I have to manually change the code so it skips this function?

Thanks

What is the command you are running?

Can't seem to reproduce with lm_eval --model hf --tasks ai2_arc , is this issue persisting?

Are there any changes you've made to the code?

I have managed to pinpoint exactly which flag causes this error. Here's a reproducible minimal example:

import lm_eval, torch

mgr = lm_eval.tasks.TaskManager()
lm_eval.simple_evaluate(model=lm_eval.models.huggingface.HFLM(pretrained="google/gemma-2b", dtype=torch.float32),
                        device="cuda", tasks=["ai2_arc"], limit=10, bootstrap_iters=0)

The error disappears when omitting the option bootstrap_iters=0.

It now makes sense why the exception is being raised. However, it is unclear from documentation what bootstrap_iters is supposed to be without going through the code.

#1789 should fix this--let me know if not!