ato / crawlspec

A prototype declarative configuration format for web archive crawl jobs

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

crawlspec

A declarative configuration format for defining web archive crawl jobs. Plus a reference implementation consisting of a command-line tool and java library for generating crawler-specific configuration and running crawls.

Status: just getting started and many things will change. The plan is to create a schema file and documentation once things are more fleshed out. At the moment the schema is defined by Job.java.

Crawlers currently supported by the reference implementation: Browsertrix, Heritrix, HTTrack, Wget. Each crawler only supports a subset of the defined options. See the @SupportedBy annotation on each option.

Example job config:

{
  "id": "example-job",
  "schedule": "0 */2 * * 1-5",
  "userAgent": "crawlspec_bot",
  "timeLimit": 3600,
  "sizeLimit": 100000000,
  "robotsPolicy": "obey",
  "seeds": [
    {
      "url": "https://example.org/",
      "scope": "domain",
      "robotsPolicy": "ignore"
    },
    {
      "url": "https://www.example.com/foo/bar.html"
    },
    {
      "url": "https://department.example/publications/annual-report-2020",
      "scope": "page"
    }
  ],
  "cookies": [
    {
      "domain": "example.org",
      "path": "/",
      "name": "sessionid",
      "value": "9945453f7ca22",
      "secure": true,
      "httpOnly": false,
      "expirationDate": 1623766184
    }
  ]
}

Expected usage: (not functional yet)

# Build a Heritrix job (crawler-beans.cxml, seeds.txt etc) 
crawlspec build myjob.json --crawler heritrix --output /heritrix/jobs/myjob/

# Immediately run a crawl job
crawlspec run --crawler wget myjob.json

# Check job config is valid and options are supported by the given crawler
crawlspec validate --crawler wget myjob.json

About

A prototype declarative configuration format for web archive crawl jobs

License:Apache License 2.0


Languages

Language:Java 98.8%Language:Shell 1.2%