A web crawler based on Storm-Crawler and News-Crawl to gather information linking patents to products
- Install Apache Storm 1.0.3
- Install ElasticSearch 2.4.1
- Install Kibana 4.6.1
- Clone and compile [https://github.com/DigitalPebble/storm-crawler] with
mvn clean install
- Start ES and Storm
This has to be done everytime the topology is restarted!!
curl -L "https://git.io/vaGkv" | bash
or ~/conf/ES_create_indices.sh
storm jar target/patent-crawler-1.0.jar com.digitalpebble.stormcrawler.elasticsearch.ESSeedInjector ~/patent-crawler/seeds/ feeds.txt -conf conf/es-conf.yaml -conf conf/crawler-conf.yaml -local
Check the injection: [http://localhost:9200/status/_search?pretty]
storm jar target/patent-crawler-1.0.jar ch.epfl.scitas.patentcrawler.CrawlTopology -conf conf/es-conf.yaml -conf conf/crawler-conf.yaml -local
Alternatively, you can run the following script that does all the previous steps:
./build_and_run_patent_crawler.sh