geosmart/HadoopSample

Hadoop MapReduce samples base on 1.0.4 hadoop version

This is a common java project, So just need to create a Jave Project in Eclipse and then add Jar file in your classpath. Use Eclipse Export way to create jar and then follow the each Sample Operation bellowing here

WordCount is a simple application that counts the number of occurences of each word in a given input set.
"data:" you can find data in testdata/WorldCount directory and then use following command put test data to specified directory and run the program
1: create directory: hadoop dfs -mkdir /test/input
2: move local file to HDFS: hadoop dfs -moveFromLocal file01 /test/input; hadoop dfs -moveFromLocal file02 /test/input
3: run WordCount MapReduce: hadoop jar YourExportJarFileName.jar WordCount /test/input /test/output
"note:" the output directory do not exist in HDFS!!!

WordCountV2 is a complex application that counts the number of occurences of each word in a given input set.
"data:" you can find data in testdata/WorldCountV2 directory and then use following command put test data to specified directory and run the program
1: create directory: hadoop dfs -mkdir /test/input
2: move local file to HDFS: hadoop dfs -moveFromLocal file01 /test/input; hadoop dfs -moveFromLocal file02 /test/input
3: run WordCount MapReduce: hadoop jar YourExportJarFileName.jar WordCountV2 /test/input /test/output
"note:" the output directory do not exist in HDFS!!!

WordCountV3 Get map input from a String Array. this demo show you how to write input object for map process.

"data:" Directly get data from self define string array, see more detail refer to MyRecordReader.java
1: run WordCount MapReduce: hadoop jar YourExportJarFileName.jar self.define.input.WordCountV3 /test/input /test/output
"note:" the output directory do not exist in HDFS!!!
summarise
InputFormat: Hadoop MapReduce calculate framwork get implementation object of InputSplit/RecordReader
InputSplit: Implementation of InputSplit is used to store matedata of data that need to be transformed to RecordReader
RecordReader: Implementation of RecordReader will read data for Map based on InputSplit that transformed in.
note
In package "copy.from.textinputformat", it's only a copy for textinputformat and relative class. use these class to understand the logical flow of how to spilt data from cluster and how to read data form cluster
MyFileInputFormat --> FileInputFormat
MyFileSplit --> FileSplit (FileSplit numbers == Map execution times)
MyLineRecordReader --> LineRecordReader
MyTextInputFormat --> TextInputFormat
You must have already found that the type of two params that is as input of the function of next in RecordReader is same as the type of the first two parameters of Map function in Map class
next(LongWritable putKeyIn, Text putValueIn) <===> map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter)

geosmart / HadoopSample

About

Languages