agravitis / chatnoir2-mapfile-generator

ChatNoir HDFS Map File Generator

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ChatNoir Map File Generator

Hadoop MapReduce tool to map raw WARC files to HDFS map files. This is the very first step when indexing a new corpus. The map files will serve as input to the actual indexer and will later be used to retrieve the raw HTML contents of a document through the web frontend.

Compiling the Sources

To build the sources, first checkout the webis-uuid repository and put it in a folder called webis-uuid next to this source directory. Then from here, call

gradle shadow

from this source directory to download other third-party dependencies and compile the sources.

The generated shadow (fat) JAR will be in build/libs. The JAR can be submitted to run on a Hadoop cluster. For ease of use, there is a helper script src/scripts/run_on_cluster.sh for starting the mapping process.

About

ChatNoir HDFS Map File Generator

License:MIT License


Languages

Language:Java 98.4%Language:Shell 1.6%