View it here: https://roundup.streamlit.app/
The purpose of this project is regularly track and present the most recent working papers in economics. ('Working papers', also known as 'pre-print' papers, present academic research that has not yet been peer-reviewed.) Remotely run via GitHub Actions once daily, this project scrapes data from 20 different websites at 6:40 AM EST, compares newly collected data to a historic database of working papers, and presents only the most recent ones on the project dashboard. The dashboard may be of use for those interested in understanding the most recent active areas of economics research, such as economists, policy-oriented researchers, and students. As of March 2024, the project incorporates data from 21 different sources.
The web scrapers in this project gather six pieces of information on each economics working paper:
- Title
- Authors
- Abstract
- Date published (If it was posted and re-posted, the most recent date of publication is used.)
- URL
- Paper ID number (according to each website's own numbering system)
Each website has a bespoke module, located in roundup-scripts/scrapers
. When the GitHub Actions project workflow, .github/workflows/main.yml
is activated at 6:40 AM EST every morning, it runs the main python script of the project - runall.py
- which cycles through each of the 20 web scraping modules. After the data are collected, it runs roundup_scripts/compare.py
, which compares the recently-gathered data to a set of working papers already seen. All those that are truly novel are assigned an estimated publication date of the day that they were first identified, and appended to the historic/papers-we-have-seen-metadata
file.
The streamlit_app.py
script produces the project dashboard, which is a user-friendly aggregation of the most recent economics research. The app draws primarily from the historic/papers-we-have-seen-metadata
file to populate itself with information.
The purpose of this repository is to maintain and improve the project dashboard. The runall.py
script and its underlying web scrapers are run remotely using a GitHub Actions. However, the repo may also be run locally on your personal computer. Instructions for local setup may be found in the Getting Started
section of this document.
Websites that are scraped for data, as of March 2024, are:
Name of website | Name of script | Scraping method |
---|---|---|
Bureau of Economic Analysis | roundup_scripts/scrapers/BEA.py | Scrapes main landing page and each individual WP's landing page using Requests and BeautifulSoup |
Becker Friedman Institute (at the University of Chicago) | roundup_scripts/scrapers/BFI.py | Scrapes main landing page and each individual WP's landing page using Requests and BeautifulSoup |
Bank for International Settlements | roundup_scripts/scrapers/BIS.py | Scrapes RSS feed using Feedparser |
Bank of England | roundup_scripts/scrapers/BOE.py | Scrapes RSS feed using Feedparser and each individual WP's landing page using Requests and BeautifulSoup |
European Central Bank | roundup_scripts/scrapers/ECB.py | Reads main landing page using a headless instance of Chrome via Selenium, and parses using BeautifulSoup |
Federal Reserve Bank of Atlanta | roundup_scripts/scrapers/Fed_Atlanta.py | Scrapes RSS feed using Feedparser |
Federal Reserve Board of Governors (of the United States): working papers | roundup_scripts/scrapers/Fed_Board.py | Scrapes main landing page using Requests and BeautifulSoup |
Federal Reserve Board of Governors (of the United States): Fed Notes | roundup_scripts/scrapers/Fed_Board_Notes.py | Scrapes main landing page using Requests and BeautifulSoup |
Federal Reserve Bank of Atlanta | roundup_scripts/scrapers/Fed_Atlanta.py | Scrapes RSS feed using Feedparser. |
Federal Reserve Bank of Boston | roundup_scripts/scrapers/Fed_Boston.py | Reads main landing page using a headless instance of a Chrome via Selenium and parses with BeautifulSoup. Extracts data from each individual WP's landing page using Requests and BeautifulSoup and reads PDF metadata using PyPDF2 and io |
Federal Reserve Bank of Chicago | roundup_scripts/scrapers/Fed_Chicago.py | Scrapes main landing page and each individual WP's landing page using Requests and BeautifulSoup |
Federal Reserve Bank of Cleveland | roundup_scripts/scrapers/Fed_Cleveland.py | Reads main landing page using a headless instance of Chrome via Selenium, and parses using BeautifulSoup. |
Federal Reserve Bank of Dallas | roundup_scripts/scrapers/Fed_Dallas.py | Scrapes main landing page using Requests and BeautifulSoup and also reads data from PDFs using PyPDF and io |
Federal Reserve Bank of Kansas City | roundup_scripts/scrapers/Fed_KansasCity.py | Reads main landing page using Requests and parses with BeautifulSoup. Extracts data from each individual WP's landing page using Requests and BeautifulSoup |
Federal Reserve Bank of Minneapolis | roundup_scripts/scrapers/Fed_Minneapolis.py | Scrapes main landing page and each individual WP's landing page using Requests and BeautifulSoup |
Federal Reserve Bank of New York | roundup_scripts/scrapers/Fed_NewYork.py | Uses Requests to access New York Fed API for JSON-formatted data on recent publications. Scrapes each individual WP's landing page using Requests and BeautifulSoup |
Federal Reserve Bank of Philadelphia | roundup_scripts/scrapers/Fed_Philadelphia.py | Reads main landing page using Selenium and parses with BeautifulSoup. Extracts data from each individual WP's landing page using Requests and BeautifulSoup and reads PDF metadata using PyPDF2 and io |
Federal Reserve Bank of Richmond | roundup_scripts/scrapers/Fed_Richmond.py | Reads main landing page using Selenium and parses with BeautifulSoup. Extracts data from each individual WP's landing page using Requests and BeautifulSoup and reads PDF metadata using PyPDF2 and io |
Federal Reserve Bank of San Francisco | roundup_scripts/scrapers/Fed_SanFrancisco.py | Uses Requests to acccess San Francisco Fed's API for JSON-formatted data on recent publications. |
Federal Reserve Bank of St. Louis | roundup_scripts/scrapers/Fed_StLouis.py | Scrapes main landing page using Requests and BeautifulSoup. |
International Monetary Fund | roundup_scripts/scrapers/IMF.py | Scrapes RSS feed using Feedparser. Scrapes each individual WP's landing page using Requests and BeautifulSoup |
National Bureau of Economic Research | roundup_scripts/scrapers/NBER.py | Interacts with NBER API and uses requests to parse the results |
The web scrapers are run remotely at 6:40 AM EST daily via the project GitHub Actions workflow located in .github/workflows/main
. However, the web scrapers may also be operated on your local machine. See below for instructions on how to run the project for the first time on your local machine and any subsequent time.
-
Clone the repository:
git clone https://github.com/lorae/roundup
-
Set your working directory into the repository:
cd ~/roundup
-
Create a virtual environment:
python -m venv .venv
-
Source the virtual environment:
-
Using bash/zsh:
source .venv/bin/activate
-
Using Windows PowerShell:
.venv/Scripts/activate
-
-
Install dependencies
python -m pip install -r requirements.txt
-
Start the script
python runall.py
-
View results:
Local unique results will be stored in
historic/weekly_data/YYYY-MM-DD-HHMM.html
. "YYYY-MM-DD-HHMM" will be populated with the day, hour and minute that you ran the code.The interactive project dashboard on https://roundup.streamlit.app will not be updated unless you commit your changes to the main branch of the project.
-
Set your working directory into the repository:
cd ~/roundup
-
Source the virtual environment:
-
Using bash/zsh:
source .venv/bin/activate
-
Using Windows PowerShell:
.venv/Scripts/activate
-
-
Start the script
python runall.py
-
View results:
"New" results - those that have not yet appeared in
papers-we-have-seen.txt
, will be stored inhistoric/weekly_data/YYYY-MM-DD-HHMM.html
. "YYYY-MM-DD-HHMM" will be populated with the day, hour and minute that you ran the code.The interactive project dashboard on https://roundup.streamlit.app will not be updated unless you commit your changes to the main branch of the project.
The schematic below illustrates the basic file structure of the project.
roundup/
│
├── .gitignore
├── README.md
├── runall.py # Main project script
│
├── src/
│ ├── compare.py # Helper module for runall script
│ ├── streamlit_app.py # Script for producing Streamlit app
│ ├── scrapers_archive/ # Archival website-specific web scraping modules
│ └── scrapers/ # Website-specific web scraping modules
│ ├── BEA.py
│ ├── BFI.py
│ ├── BIS.py
│ └── ...
│
├── data/
│ ├── papers-we-have-seen.txt/ # Set of unique IDs for data previously encountered
│ └── papers-we-have-seen-metadata.csv/ # Processed data goes here
│
├── .github/workflows/
│ └── main.yml # Runs `Daily Run` GitHub Actions workflow
│
└── requirements.txt # Project dependencies
roundup
The project directory.
-
runall.py:
The main script in this project. It loops through each of the modules inroundup_scripts/scrapers/XXX.py
, running their respectiveXXX.scrape()
functions. This collects data from each website, which is then gathered into a large data frame. If an individualXXX.scrape()
function produces an error, then therunall.py
script will update that module's entry inscraper_status.txt
to "off". If aXXX.scrape()
function does not produce and error, then therunall.py
script will update that module's entry inscraper_status.txt
to "on".After all web scrapers have been run and their data aggregated into a large data frame,
runall.py
invokes thecompare_historic(df)
function from theroundup_scripts/compare.py
module to isolate only those data which are truly novel.compare_historic(df)
uses data frompapers_we_have_seen.txt
as the source of historical truth in order to make this determination. Oncecompare_historic(df)
has been successfully executed, new date- and time- stamped files are saved ashistoric/weekly_data/YYYY-MM-DD-HHMM.csv
,historic/weekly_data/YYYY-MM-DD-HHMM.txt
, andhistoric/weekly_data/YYYY-MM-DD-HHMM.html
which contain metadata (title, authors, abstract, URL, date published, paper number, and unique paper ID number) on only the working papers that have not previously been scraped by runall.py. It will also append these new entries tohistoric/papers-we-have-seen-metadata.csv
. -
README.md:
The document you are currently reading. -
requirements.txt:
A list of libraries and modules . -
scraper_status.txt:
A file that lists whether each scraper is operational ("on") or not ("off"). This file is used as input for the Streamlit app (src/streamlit_app.py
), which displays scraper statuses in the side panel. A scraper status helps visitors of the website determine which web scrapers are currently operational. It also allows project developers to target modules for maintenance. The changing nature of websites means that even the most well-coded web scrapers will fail eventually. -
historic:
A folder containing data that has been previously scraped in this project.-
papers_we_have_seen.txt:
A file that can be considered the main historical record of the project. It tellscompare.py
which papers we have seen and which we haven’t by storing all of the index numbers of the papers that have been seen as a python set. Note that no data is stored here aside from index numbers (this is a memory-saving feature of the repository). -
weekly_data:
A folder containing the data that is gathered in every scrape of the project. Files are stored in the formatYYYY-MM-DD-HHMM.csv
,YYYY-MM-DD-HHMM.txt
, andYYYY-MM-DD-HHMM.html
for the time the code was run. The.csv
and.html
files contain the actual data that was newly seen in a given run ofrunall.py
. This new data can easily be viewed using Microsoft Excel or using a browser. The.txt
files are intended more for reference. They contain only the ID numbers of the of the novel data.
-
-
src:
A folder containing all of the code used in the project, except forrunall.py
.-
compare.py:
A module used withinrunall.py
. Primary function of interest iscompare_historic(df)
, which takes the most recently scraped data as its only input argument and compares it to the data inpapers_we_have_seen.txt
. It then outputs the unique ID numbers and data corresponding with papers that are newly seen and saves the new data in the main database of the project,historic/papers-we-have-seen-metadata.csv
, as well as local copies,historic/weekly_data/YYYY-MM-DD-HHMM.csv
andhistoric/weekly_data/YYYY-MM-DD-HHMM.txt
. -
scrapers:
A folder that contains each of the individual web scraping modules. Each one is customized to a specific website such as BIS, Chicago Fed, NBER, etc. The scripts are named accordingly. All modules in this folder have analogous functions calledscrape()
, which aggregate data from their respective websites to output data frames. So, for example, inrunall.py
, we can import BIS and runBIS.scrape()
to get the most recent data (formatted as a pandas data frame) from the Bank for International Settlements, or we can import NBER and runNBER.scrape()
to get a data frame of the most recent data scraped from the National Bureau of Economic Research. -
scrapers_archive:
A folder that contains potentially useful archived scraper code.
-