HBaseWordCount Cloud computing project 4

Write an HBase WordCount program to count all unique terms’ occurrences from the clueWeb09 dataset. Each row record of columnfamily ”frequencies” is unique; the rowkey is the unique term stored in byte format, column name is ”count” and value is the term frequency shown in all documents. Load the result to HBase WordCountTable. Figure 1 shows the schema of WordCountTable. You will compare the results of your finished run to a correct version we will supply to you.

Introduction

WordCount is a simple program which counts the number of occurrences of each word in a given text input dataset. It fits very well with the map/reduce programming model, making WordCount a great example to understand the Hadoop MapReduce programming style. Instead of loading the data from HDFS, we will load our data directly from existing HBase records which store the similar content structures on HBase and HDFS.

In this homework and the next homework (Building an Inverted Index) we use the same source code, which can be found in: /root/MoocHomeworks/HBaseWordCount.

References

Clueweb09 dataset. http://lemurproject.org/clueweb09/.
Hadoop WordCount. http://salsahpc.indiana.edu/csci-b649-spring-2014/projects/project1.html.
HBase MapReduce Examples. http://hbase.apache.org/book/mapreduce.example.html.

For more details, click here

About

Implementation of word count algorithm using HBase

hadoop hbase java word-count

Languages

Language:Java 100.0%