allenai / peS2o

Pretraining Efficiently on S2ORC!

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

README states S2ORC contains 11.3M papers but the original S2ORC paper claims 80M

jtalmi opened this issue · comments

The original paper says there are 80M papers available: https://aclanthology.org/2020.acl-main.447.pdf

Curious why there's a difference here?

Hi @jtalmi,

There are two factors at play here:

  • The original release of S2ORC contains 80M full text papers, of which 10.8M were full text papers. The remaining 68.2M were titles + abstracts.
  • When the method of distribution of S2ORC changed (from being distributed through a Google Form to using Semantic Scholar API), it went through a name change. S2ORC now refers to just the full-text subset (which contained 11.3M papers in March 2023), while the abstract subsection is called S2AG.

Hope that helps!

Best,
Luca

thanks!