"Using Baselines for Algorithm Audits" submitted to European Data and Computational Journalism Conference, 2017

By Jennifer A Stark and Nick Diakopoulos

The Data

Collecting data

Data were collected automatically using a web scraper once per day using code based on this project. Images were downloaded and related information such as the web link, collection datetime, the search term (e.g. Hillary Clinton, Donald Trump) etc were stored in a MySQL database housed on our AWS space which was then filtered and downloaded as a csv.

Processed data

Baseline image processing can be found in BASELINE directory, while data processing for image box images, found on the main Google search results page, are in IMAGE_BOX.

Analysis is divided up into analysing the images themselves for sentiment using the Microsoft APIs, and analysing the sources of the images (e.g. Business Insider, Breitbart, Salon). News source main analysis can be found in Statistics.ipynb in the main directory.

Requirements

Python 3
ipython notebook / Jupyter
pandas
numpy
matplotlib.pyplot
json
shelve
PIL
imagehash
argparse
GoogleScraper

Funding

This project was funded by a grant from the Tow Center for Digital Journalism to study computational and data journalism in the context of algorithmic accountability reporting.

Feedback

Email Jennifer A Stark at starkja@umd.edu

About

MIT License

Languages

Language:HTML 83.0%Language:Jupyter Notebook 16.8%Language:Python 0.2%