README states S2ORC contains 11.3M papers but the original S2ORC paper claims 80M
jtalmi opened this issue · comments
The original paper says there are 80M papers available: https://aclanthology.org/2020.acl-main.447.pdf
Curious why there's a difference here?
Hi @jtalmi,
There are two factors at play here:
- The original release of S2ORC contains 80M full text papers, of which 10.8M were full text papers. The remaining 68.2M were titles + abstracts.
- When the method of distribution of S2ORC changed (from being distributed through a Google Form to using Semantic Scholar API), it went through a name change. S2ORC now refers to just the full-text subset (which contained 11.3M papers in March 2023), while the abstract subsection is called S2AG.
Hope that helps!
Best,
Luca
thanks!