hendrycks / test

Measuring Massive Multitask Language Understanding | ICLR 2021

Home Page:https://arxiv.org/abs/2009.03300

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Human level performance?

rodrigonogueira4 opened this issue · comments

Hi, first of all, thanks for releasing this great dataset!

In the abstract you wrote:
"on every one of the 57 tasks, the best models still need substantial improvements before they can reach human-level accuracy",
but I could not find human performance numbers in the paper. Do you plan to include them anytime soon?

Thanks!

We have changed the abstract to say "expert-level accuracy." For most nearly all tasks this is >= 90%. Hence an average of score of 90% should eventually be possible.
Human-level performance would vary substantially from human-to-human. I surmise that most high school graduates would get <= 40%. Colleges that have a broad core curriculum (e.g., Columbia, UChicago) might have graduates that get <= 60%.

Great, thanks for the prompt reply!

Hi @hendrycks and team, thanks for releasing such an awesome dataset!! I was wondering if there was any publicly-releasable MTurk/crowdsourced worker performance from the paper? I.e., the paper mentions: "Unspecialized humans from Amazon Mechanical Turk obtain 34.5% accuracy on this test" <--- are there details on the experiment(s) used to get this number? How many participants were run? And did each participant answer questions for just one topic or many? No worries if this is not available, just wanted to check. Thank you!

Got it, thanks for the speedy response @hendrycks ! That's useful to know.