GalacticExchange/scraper

Oracle Java 8
- sudo add-apt-repository ppa:webupd8team/java
- sudo apt-get update
- sudo apt-get install oracle-java8-installer
- sudo apt-get install oracle-java8-set-default
Maven
- sudo apt-get update
- sudo apt-get install maven
Node.js 6.x and npm
- curl -sL https://deb.nodesource.com/setup_6.x | sudo -E bash -
- sudo apt-get install -y nodejs

Build project
Application needs Consul, ElasticSearch and Nutch REST API running.
- Consul (https://hub.docker.com/r/progrium/consul/) docker run -p 8400:8400 -p 8500:8500 -p 8600:53/udp -h node1 progrium/consul -server -bootstrap
- ElasticSearch (https://hub.docker.com/r/nshou/elasticsearch-kibana/) docker run -d -p 9200:9200 -p 9300:9300 -p 5601:5601 nshou/elasticsearch-kibana:kibana4
- For Nutch API run scraper docker container.
For logs you should create folder with path /usr/local/scraper with write permissions to all
Run project from main class(io.gex.scraper.api.Main) with two parameters path_to_config and -dev
- Config file example { "appId":"1234", "webServerPort": 4567, "consulHost": "localhost", "consulPort": 8500, "nutchHost": "http://0.0.0.0", "nutchPort": 8081, "defScrapArchJob": { "urls": null, "crawlIndexesHost": "http://index.commoncrawl.org", "warcFilesHost": "http://commoncrawl.s3.amazonaws.com/", "crawlLinksLimit": null, "fromYear": 2017, "toYear": 2017, "fetchThreadsNum": 32, "elastic": { "host": "0.0.0.0", "port": 9300, "clusterName": null, "indexName": "scraper", "type": "scrap_old_data" } }, "defScrapJob": { "urls": null, "depth": 2, "interval": 7200, "extractArticle": false, "elasticIndexName": "scraper" } }
Go to http://0.0.0.0:3000. By default for debug start up two web servers: java web server on port 4567 and node.js web server on port 3000 which proxy java web server for dynamically adding assets.

GalacticExchange / scraper