eorliac / patent-crawler

A web crawler based on storm-crawler to gather information linking patents to products

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

patent-crawler

A web crawler based on Storm-Crawler and News-Crawl to gather information linking patents to products

Prerequisites

Build ES indices

This has to be done everytime the topology is restarted!!

curl -L "https://git.io/vaGkv" | bash or ~/conf/ES_create_indices.sh

Inject the seeds

storm jar target/patent-crawler-1.0.jar com.digitalpebble.stormcrawler.elasticsearch.ESSeedInjector ~/patent-crawler/seeds/ feeds.txt -conf conf/es-conf.yaml -conf conf/crawler-conf.yaml -local

Check the injection: [http://localhost:9200/status/_search?pretty]

Run the topology

storm jar target/patent-crawler-1.0.jar ch.epfl.scitas.patentcrawler.CrawlTopology -conf conf/es-conf.yaml -conf conf/crawler-conf.yaml -local

Build & run utility

Alternatively, you can run the following script that does all the previous steps: ./build_and_run_patent_crawler.sh

About

A web crawler based on storm-crawler to gather information linking patents to products

License:Apache License 2.0


Languages

Language:Java 94.9%Language:Shell 4.0%Language:FLUX 1.1%