RePEc
cfoster0 opened this issue · comments
Language: predominantly English
Date ranges: 1997-2020
Size: Claims 2M downloadable articles, 800K working papers, 26K books, and 59K chapters
Research Papers in Economics (RePEc) is a collaborative effort of hundreds of volunteers in many countries to enhance the dissemination of research in economics. The heart of the project is a decentralized database of working papers, preprints, journal articles, and software components.
We would be extracting the text components only. From what I've seen, it's PDFs.
For scraping, we can traverse the following open directories that index the various content forms:
- https://ideas.repec.org/p/ -> Working Papers
- https://ideas.repec.org/a/ -> Journal Articles
- https://ideas.repec.org/b/ -> Books
- https://ideas.repec.org/h/ -> Book Chapters
The bottom level links are pages on the RePEc Ideas database. The downloadable link on that page (if it exists) seems to be the value of an input button tagged as "url".
This should probably be deferred to v2, since (1) it's huge, needs time to download and (2) I've tried our PDF-to-text on some samples and I think it'll need at least a slight rework.
This would be a good addition.