all-seeing-eye

Fighting hate by scraping web advertisements and holding the advertisers accountable.

Short description: create a series of simple tools to scrape the ads on webpages and build a list of the advertisers.

MVP

Create a selenium or siliar tool that can go to a webpage, traverse the pages, pick out the advertisements and screen capture them (preferably cropped) and save to disk or a database with an associated json file of metadata.
Create cloud infrastructure to run multiple versions of the selenium tool and have them dump to the same bucket or db.
Scrape the target page for set amount of time or until a certain number of images are collected.
Create a simple python script to look at the ads and group similar ads (first pass using image stats or md5, later feature descriptor approaches).
MVP: Once the advertisemnt corpus is reduced such that there is single instance of each advertisement, expose the ads to a crowd platform for labeling (e.g. crowd flower).
Long term: use OCR and a neural net to recognize text and logos.
MVP: Create a jupyter notebook to retrive crowd labeled data, merge it with the raw collection data, clean it up, and show results.
Publish results in a blog post.

pull requests gladly accepted

see something you can do, hit me up, write a ticket, go to town.

About

Fighting hate by scraping web advertisements and holding the advertisers accountable.

BSD 3-Clause "New" or "Revised" License