noiano / ARCInputFormat

Packages the ARCInputFormat used in Common Crawl in a small jar file that can be used in MapReduce jobs. Implements HdfsARCSource. See README for details

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

This project extracts from the original commoncrawl project only the ARCInputFormat class and its dependencies. It also implement a new ARCSource, HDFSSource, which allows ARC files to be read from HDFS.

Differences from the original project:

How to compile

In order to ensure a successful compilation of the library please modify the build.proprieties file and set the hadoop.path variable correctly. Then simply invoke:

ant

You'll find ARCInputFormat.jar ready for use.

About

Packages the ARCInputFormat used in Common Crawl in a small jar file that can be used in MapReduce jobs. Implements HdfsARCSource. See README for details

License:Apache License 2.0


Languages

Language:Java 100.0%