This is a quick and dirty script designed to scrape some data from allsides.com
, which tracks media bias. The website does not keep a handy historical archive for study, but it is possible to track their bias ratings over time by looking at copies of their site archived on Wayback Machine. This script accepts an Excel spreadsheet of urls on Wayback machine, downloads the output, and parses out the names of the news sources and their biases. It outputs the results in a CSV file.
- Python 3.x
- python-fire (install with
pip install fire
) - python-tabulator (install with
pip install tabulator
) - An Excel file with input data and the headers
url
anddate
in the first row. Theurl
should be something likehttps://web.archive.org/web/20120830002424/http://allsides.com
. Dates can be in any format. They will be converted toYYYY-MM-DD
in the output.
The script is run off the command line as follows:
python bias_scraper.py --source=path_to_excel_file --output=path_to_output_csv_file
If you are working in this repo's local folder, the command will be
python bias_scraper.py --source=sample.xlsx --output=test.csv
Other command flags are
--help
: A list of all arguments and their usage--save
: IfFalse
, the script will output the CSV data to the terminal but will not save it as a file. Default isTrue
--mode
: Ifa
, the script will append the output to a previously existing output file instead of creating a new one. Default isw
.