TransparencyToolkit / CrawlerManager

API for calling crawlers

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CrawlerManager

API for running and managing crawlers and parsing results

CrawlerManager can be used in combination with Harvester web interface to run queries and load results.

Installing

Make sure you have the proper system dependencies with

  • On Debian, do the following:
    • sudo apt-get install sqlite3 libsqlite3-dev
  • Get the code git clone https://github.com/TransparencyToolkit/CrawlerManager
  • Install Ruby dependencies bundle install

Setup

  • Create the databases rake db:create:all
  • Reset existing databases rake db:reset

WARNING

Currently, for Harvester to save data, you need to have the path /home/user/Data/KG/ and /home/user/Data/KG/All_Pics/ to exist. This is kludgy and will be configurable soon!

Running CrawlerManager

Additional Configuration

To use proxies, set environment variable PROXYLIST to the path to the proxylist you want to use.

To solve CAPTCHAs, set environment variable SOLVERDETAILS to your 2Captcha key.

About

API for calling crawlers

License:GNU General Public License v3.0


Languages

Language:Ruby 86.3%Language:HTML 13.7%