EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.

Home Page:https://www.eleuther.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Need wide & small task for fast evaluation

UmerHA opened this issue · comments

Hi all,

I want to do a large-ish study on quanting methods and their effect on model performance. For this, I need an evaluation that's (i) "wide" (ie covers a broad set of tasks / topics) and small (so it's quick and cheap to run).

Iiuc, currently there is no task for that.

I suggest we add the dharma2 dataset (samples from 8 tasks, incl MMLU ; 300 examples in total); or alternatively big-bench lite (samples from 24 tasks).

I'll fork this repo and add dharma2. If there's interest, I'd be happy to submit a PR.

We always welcome more task PRs. Additionally, if there's a broad task that meets your needs you can use the --limit flag