Bin Wang's repositories
hadoop_raspberrypi
setting up hadoop on raspberry pi
docker-selenium-hub
docker image for selenium server with headless firefox
Language:Shell000
docker_scrapy
a scrapy template with bare minimum effort to be able to get the html of a list of urls
Language:Python000
getout
this is a python library to extract outlinks for a given URL
Language:Python000
namemapping
A name mapping library by Dan and Bin to cluster company names using Yahoo Boss API
Language:Python000
nutch-selenium-grid-plugin
A Nutch 2.2.1 plugin which allows users to shuffle off the responsibility for retrieving pages to a selenium hub/node spoke system. This allows Nutch to rely on Selenium/Firefox to fetch and load javascript/content; while keeping Nutch in charge of what it does best: crawling and further parsing.
000
rgetout
A R package to get all the outlinks for a given URL
Language:R000