avaiyang / Hadoop-HDFS-MapReduce

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Hadoop-HDFS-MapReduce

Write a MapReduce code to count phrases (pair of words, taken sequentially) in lines of text as follows. The default input key for the TextInput format is the line number. In this program , the required output is the count of word pairs.

NOTE: punctuation will NOT count; so the words is ‘(1991)’ and ‘1991’ are the same. Therefore, you must parse your file: remove all characters not in this set: [a-z, A-Z, 0-9] ; ‘the-cat’ and ‘the cat’ should be counted as the same pair.

e.g. line:

Input: (k,v) = (1,“the cat in the hat is the best cat in the hat”)

Output: (k2,v2) = (“the cat”, 1), (“cat in”,2),(‘in the’,2),(‘the hat’,2’),(‘hat is’,1), etc

If the line contains a single word, then that becomes the ‘pair’. For this code,use the input file: adventures.txt

Note: The code MUST run in Hadoop.

About


Languages

Language:Python 100.0%