Step By Step guide for Hadoop installation on Ubuntu 16.04.3 with MapReduce example using Streaming

Download Virtual Box from: https://www.virtualbox.org/wiki/Downloads
Download Ubuntu 16.04.3 (desktop version amd64) from: https://www.ubuntu.com/download/desktop OR Direct Download from: http://mirror.pnl.gov/releases/xenial/ubuntu-16.04.3-desktop-amd64.iso
create a VM with Ubuntu 16.04.3 image
After installing Ubuntu login to th VM and follow instructions given in https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html . Here I am giving step by step details for the installation steps.
First we will update the system's local repository and then install JAVA (default JDK). Run below commands on the terminal.

sudo apt-get update

sudo apt-get install default-jdk -y
Now we will install ssh and rsync packages by running following commands.

sudo apt-get install ssh -y

sudo apt-get install rsync -y
Now download Hadoop 2.7.4 from http://www.apache.org/dyn/closer.cgi/hadoop/common/
Change directory to Downloads or where ever you have downloaded the hadoop tar file. In my case it is in Downloads and all further instruction are considering that hadoop tart file is in ~/Downloads.
Change directory to extracted folder
Update JAVA_HOME variable in etc/hadoop/hadoop-env.sh file using gedit command as shown below.

export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")

bin/hadoop

Now we will update some configuration files for pseudo-distributed operation. First we will edit etc/hadoop/core-site.xml file as below.

<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration>

<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>

Now we will setup passwordless ssh for Hadoop. First check if you already have passwordless ssh authentication setup; if it is new Ubuntu installation most likely it wouldn't set up. If passwordless ssh authentication is not setup, please follow next step othervise skip it.

ssh localhost

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

chmod 0600 ~/.ssh/authorized_keys

Now we will start NameNode and DataNode but before that we will format the HDFS file system.
Now we can access Web-interface for NameNode at http://localhost:50070/
Now let's create some directories in HDFS filesystem.
Let's download one html page http://hadoop.apache.org and upload into HDFS file system.

wget http://hadoop.apache.org -O hadoop_home_page.html

Please note that HDFS file system is not same as root file system.

Grep example:

For this example we are using hadoop-mapreduce-examples-2.7.4.jar file which comes along with Hadoop. In this example we are trying to count the total number of 'https' word occurences in the given files. First we run the Hadoop job then copy the results from HDFS to the local file system. We can see that there are 2 occurences of https in the given file and same we can validate using wget command.

For wordcount example also we are using hadoop-mapreduce-examples-2.7.4.jar file. The wordcount example returns the count of each word in the given documents.

Step By Step guide for Hadoop installation on Ubuntu 16.04.3 with MapReduce example using Streaming

Language:Python 100.0%