hadoop

Hadoop 2.2.0 for Ubuntu 12.04 LTS x86_64 - This setup was executed on an EC2 m1.small.

Start Hadoop

/usr/local/hadoop/sbin/start-all.sh

Installation

Install Java

sudo add-apt-repository ppa:webupd8team/java  
sudo apt-get update && sudo apt-get upgrade  
sudo apt-get install oracle-java7-installer

Hadoop Installation

Hadoop User

sudo addgroup hadoop
su - hduser
ssh-keygen -t rsa -P ""
sudo adduser --ingroup hadoop hduser
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Hadoop Download

cd /usr/local
wget http://apache.petsads.us/hadoop/core/stable/hadoop-2.2.0.tar.gz
tar -xvf hadoop-2.2.0.tar.gz
mv hadoop-2.2.0 hadoop
sudo chown -R hduser:hadoop hadoop

Setup .bashrc

export HADOOP_HOME=/usr/local/hadoop
export JAVA_HOME=/usr/lib/jvm/java-7-oracle
export PATH=$PATH:$HADOOP_HOME/bin

unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
alias hls="fs -ls"

Hadoop Configuration

sudo mkdir -p /app/hadoop/tmp
sudo chown hduser:hadoop /app/hadoop/tmp
sudo chmod 750 /app/hadoop/tmp

Update XML Files

In Hadoop 2.2.0 the XML files are located in: /usr/local/hadoop/etc/hadoop See the contents of the following files in this repository:

core-site.xml
mapred-site.xml
hdfs-site.xml
/usr/local/hadoop/etc/hadoop (log4j.properties)

Prepare Hadoop Input

For this example, I'm performing the "Hello World" of Map Reduce, which is "Word Count". I've downloaded three Project Gutenberg UTF-8 books and placed them in the "/tmp/gutenberg" which I will copy to HDFS "/gutenberg_input".

hadoop dfs -copyFromLocal /tmp/gutenberg /gutenberg_input

Execute Hadoop Job

javac -classpath /home/ubuntu/.m2/repository/org/apache/hadoop/hadoop-core/0.20.2-cdh3u2/hadoop-core-0.20.2-cdh3u2.jar -d wordcount_classes WordCount.java
jar -cvf wordcount.jar -C wordcount_classes/ .
hadoop jar wordcount.jar org.myorg.WordCount /gutenberg_input /gutenberg_output

After the job completes, view the output and copy from HDFS:

hls /gutenberg_output

Found 2 items
-rw-r--r--   1 hduser supergroup          0 2013-11-25 23:44 /gutenberg_output/_SUCCESS
-rw-r--r--   1 hduser supergroup     880838 2013-11-25 23:44 /gutenberg_output/part-r-00000

hadoop dfs -getmerge /gutenberg_output /tmp/gutenberg_output
tail -n 400 /tmp/gutenberg_output
wristbangles	1
wristlet,	1
wrists	1
wrists,	1
writ	4
writ.	1
write	62
...

MapReduce I: Word Count in Python

Using the mapper and reducer written in python included in the repository:

mapper.py
reducer.py

We can test they are running properly:

echo "foo foo quuz labs foo bar quuz" | ./mapper.py  | sort -k1,1 | ./reducer.py 
bar	1
foo	3
labs	1
quuz	2

Now we'll execute them with Hadoop. (I've included the streaming 2.2 .jar in the repo as well)

hadoop jar hadoop-streaming-2.2.0.jar -file mapper.py -mapper mapper.py -file reducer.py  -reducer reducer.py -input /gutenberg_input -output /gutenburg_output_python
...
...
hadoop dfs -cat /gutenburg_output_python/part-00000 | more
"(Lo)cra"	1
"1490	1
"1498,"	1
"35"	1
"40,"	1
"A	2
"AS-IS".	1
...

MapReduce II: Inverted Index in Python

Based on the Java Iverted Index (but ghetto simplified) see the python files included in the repo:

inverted_index/mapper.py
inverted_index/reducer.ph
inverted_index/a.txt, b.txt

Test they are running:

cat a.txt b.txt | ./mapper.py | ./reducer.py 
bat b.txt
sat a.txt
mat a.txt
fat b.txt
cat a.txt,b.txt

Copy the files into HDFS

hdfs dfs -mkdir /inverted_index_input
hdfs dfs -put a.txt /inverted_index_input/a.txt
hdfs dfs -put b.txt /inverted_index_input/b.txt
hdfs dfs -ls /inverted_index_input/
-rw-r--r--   1 hduser supergroup         18 2013-11-27 20:01 /inverted_index_input/a.txt
-rw-r--r--   1 hduser supergroup         18 2013-11-27 20:02 /inverted_index_input/b.txt

Now execute the hadoop job:

hadoop jar ../hadoop-streaming-2.2.0.jar -file mapper.py -mapper mapper.py -file reducer.py -reducer reducer.py -input /inverted_index_input/ -output /inverted_index_output_python

Check the output:

hdfs dfs -ls /inverted_index_input/
-rw-r--r--   1 hduser supergroup          0 2013-11-27 20:03 /inverted_index_output_python/_SUCCESS
-rw-r--r--   1 hduser supergroup         61 2013-11-27 20:03 /inverted_index_output_python/part-00000

hdfs dfs -cat /inverted_index_output_python/part-00000
bat b.txt	
sat a.txt	
mat a.txt	
fat b.txt	
cat a.txt,b.txt

mr-sk / hadoop