mpiingrams was created by the Databases and Information Systems Department at the Max Planck Institute for Informatics (http://www.mpi-inf.mpg.de). It provides implementations of different methods to compute n-gram statistics using Apache Hadoop. For a detailed description of the methods and an experimental comparison of them on different datasets, please refer to our paper:
Klaus Berberich and Srikanta Bedathur: Computing n-Gram Statistics in MapReduce, In Proceedings of 16th International Conference on Extending Database Technology (EDBT 2013) [PDF]
Our implementation expects its input in a specific format that consists of a dictionary, mapping words to identifiers and their counts, as well as documents compactly encoded as integer sequences.
We provide a tool (`de.mpii.ngrams.io.ConvertText2Input) to convert plain-text data into this format. This converter removes all non-alphanumeric characters from the input, treats white spaces as word separators, and newlines as sentence boundaries.
For other kinds of input data, you'll have to write your own tool to convert it into the format expected by our implementation.
We provide four different implementations (NGSuffixSigma
, NGNaive
, NGAprioriScan
,
NGAprioriIndex
) to compute n-gram statistics. We strongly recommend that you use
NGSuffixSigma
, which has been shown in our experiments to be the most robust
among the methods. All methods expect the same parameters and are invoked as follows:
hadoop jar de.mpii.ngrams.methods.{NGSuffixSigma, NGNaive, NGAprioriScan, NGAprioriIndex} <input> <output> <minimum_support> <maximum_length> <type> <number_of_reducers>
input
the HDFS path with your dataset converted into our integer-sequence formatoutput
the HDFS path where you would like to store the computed n-gram statisticsminimum_support
your choice of the minimum support \taumaximum_length
your choice of the maximum length \sigma (use 0 for non-restricted length)type
indicates whether you want to compute statistics for all (0), closed (1), or maximal (2) n-gramsnumber_of_reducers
the number of reducers that you want to use
Our implementation outputs n-grams as integer sequences. We provide a tool
(de.mpii.ngrams.io.ConvertOutput2Text
) to convert our output to plain text.
Let's assume you have the sample data provided with this code (the top-5 books from Project Gutenberg at the time of writing this documentation) stored in ./data in your HDFS home directory. To convert it using 4 reducers, use:
hadoop jar de.mpii.ngrams.io.ConvertText2Input data input 4
To compute statistics for all n-grams that occur at least five times (i.e., tau = 5) and have length at most ten (i.e., sigma = 10) using our suffix-based method with 4 reducers, invoke:
hadoop jar de.mpii.ngrams.methods.NGSuffixSigma input output 5 10 0 4
To convert the computed n-gram statistics to plain text, use:
hadoop jar de.mpii.ngrams.io.ConvertOutput2Text output input stats
Your statistics are then ready in the ./stats directory and can be viewed using:
hadoop fs -cat stats/part*
If you use our code for research, please cite:
@inproceedings{Berberich2013,
author = {Berberich, Klaus and Bedathur, Srikanta},
title = {{Computing n-Gram Statistics in MapReduce}},
booktitle = {16th International Conference on Extending Database Technology, EDBT '13, Genua, Italy},
year = {2013}
}
mpiingrams by Max-Planck-Institute for Informatics, Databases and Information Systems is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
Our software uses the following libraries (included in the ./lib directory), which are both licensed under the Apache 2.0 license (included in the ./licenses directory):
FastUtil (6.5.2) http://fastutil.di.unimi.it
MapDB (0.9) http://www.mapdb.org
The code has been tested on Cloudera CDH3u0. While we expect it to also work with other distributions and/or newer versions of Apache Hadoop, we haven't tested it and cannot provide any support in this direction.