jg-bernard's starred repositories
memes_pipeline
Memes Processing Pipeline that enables the track of memes across multiple Web communities.
WebScraping
SSRMC lectures in Web Scraping, HT 2014
fb_scrape_public
Scrapes posts and comments from public Facebook pages.
news_extract
Python module to extract articles from NexisUni and Factiva.
api-client
Public client for consuming content from the Media Cloud Online News Archive & Directory.
feed_seeker
Find rss, atom, xml, and rdf feeds on webpages
date_guesser
A library to extract a publication date from a web page, along with a measure of the accuracy.
nyt-news-labeler
Tag news stories based on models trained on the NYT corpus.
opencorpora
A web-based engine for creating and annotating textual corpora
odie_backend
The admin site and api data source for the Online Discourse Insight Explorer.
corpusbuilder
Corpus Build OCR platform
lumendatabase
The Lumen Database collects and analyzes legal complaints and requests for removal of online materials.
ultimate-sitemap-parser
Ultimate Website Sitemap Parser
sentence-splitter
Text to sentence splitter using heuristic algorithm by Philipp Koehn and Josh Schroeder.
test-lists
URL testing lists intended for discovering website censorship
internet_monitor
The Internet Monitor is a research project to evaluate, describe, and summarize the means, mechanisms, and extent of Internet content controls and Internet activity around the world.