govau / wofg-web-filters

Filters for processing Web ARChive (WARC) files as part of the WofG Web Reporting Service

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

wofg-web-filters

Filters for processing Web ARChive (WARC) files as part of the WofG Web Reporting Service

Overview

These filters were originally developed with Funnelback for use in both in-crawl and post-crawl filtering of data gathered during a Whole-of-Australian Government web crawl.

Pre-gather workflow tasks are run in order to generate mappings for domains to portfolios (drawn from the Australian Government Organisation Register) and augment with other external data sources.

Post-gather, several content checks are run. These are written in Groovy, and are run with Funnelback's filter framework. Tools for splitting WARC files are also included at this stage.

Post-filtering, metadata is written to JSON for injecting into ElasticSearch.

About

Filters for processing Web ARChive (WARC) files as part of the WofG Web Reporting Service

License:MIT License


Languages

Language:Groovy 92.4%Language:Shell 7.6%