Innoplexus's starred repositories
WebFundamentals
Former git repo for WebFundamentals on developers.google.com
openlibrary
One webpage for every book ever published!
conceptnet5
Code for building ConceptNet from raw data.
facebook-sdk
Python SDK for Facebook's Graph API
gitinspector
:bar_chart: The statistical analysis tool for git repositories
dr-elephant
Dr. Elephant is a job and flow-level performance monitoring and tuning tool for Apache Hadoop and Apache Spark
linkedin-scraper
Scrapes the public profile of the linkedin page
ebot
Ebot, an Opensource Web Crawler built on top of a nosql database (apache couchdb, riak), AMQP database (rabbitmq), webmachine and mochiweb. Ebot is written in Erlang and it is a very scalable, distribuited and highly configurable web cawler. See wiki pages for more details
commoncrawl-crawler
The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)
cdx-index-client
A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/
analyze_ocr
Parse OCR result files for pagenos, tables of contents, etc.
webarchive-indexing
Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.