EleutherAI / the-pile

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

RePEc

cfoster0 opened this issue · comments

Language: predominantly English
Date ranges: 1997-2020
Size: Claims 2M downloadable articles, 800K working papers, 26K books, and 59K chapters

Research Papers in Economics (RePEc) is a collaborative effort of hundreds of volunteers in many countries to enhance the dissemination of research in economics. The heart of the project is a decentralized database of working papers, preprints, journal articles, and software components.

We would be extracting the text components only. From what I've seen, it's PDFs.

http://www.repec.org/

For scraping, we can traverse the following open directories that index the various content forms:

The bottom level links are pages on the RePEc Ideas database. The downloadable link on that page (if it exists) seems to be the value of an input button tagged as "url".

This should probably be deferred to v2, since (1) it's huge, needs time to download and (2) I've tried our PDF-to-text on some samples and I think it'll need at least a slight rework.

This would be a good addition.