This repo holds ETL/ELT code to scrape news sites, do some light transforms, and save them to disk (and eventually a database).
Currently, you can just run the newspaper_pull.py
script. Feel free to change the saveDir
line to change where your files save off to, and update sites.py
if you want to get different news sites. The default list is pretty lengthy (~65 sites); it takes 10-15 minutes to run.