HadoopInstallGuide

Apache Hadoop Components Installation Guide on Windows

Apache Hadoop Installation

Download Java JDK 8 (v1.8.0_291).
Download Hadoop Binary (v3.3.0) Latest Version.
Create new folder named Hadoop in the directory where we want to keep all things related to Hadoop & extract hadoop binary in it.
Setting Environment Path Variables:
- Set Variable as JAVA_HOME & Value as <Java Root Path>.
- Set Variable as HADOOP_HOME & Value as <Hadoop Root Path>.
- Add following paths to Path Variable:
  - <Java Bin Path>
  - <Hadoop Bin Path>
  - <Hadoop Sbin Path>
Check if Java is installed properly by running following commands:
- javac
- java -version
Make new folder named data in root directory of Hadoop followed by:
- Making new folder named datanode inside data folder.
- Making new folder named namenode inside data folder.

Make changes in 4 hadoop files located in etc/hadoop/:

core-site.xml

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
        <!-- <value>hdfs://0.0.0.0:19000</value> -->
    </property>

    <property>
        <name>hadoop.tmp.dir</name>
        <value>file:///E:/Rohit/Hadoop/hadoop/tmp/hadoop-${user.name}<value>
    </property>
</configuration>

mapred-site.xml

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

yarn-site.xml

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    
    <property>
        <name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>  
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
</configuration>

hdfs-site.xml

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>

    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:///E:/Rohit/Hadoop/hadoop/data/namenode</value>
    </property>

    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:///E:/Rohit/Hadoop/hadoop/data/datanode</value>
    </property>
</configuration>

Download files for support of Windows & add it to bin folder.
Start Hadoop by opening Terminal as Administrator & by running following command:
- start-all.cmd (or start-dfs.cmd & start-yarn.cmd)
Command to check all the Hadoop daemons like DataNode, NameNode, NodeManager & ResourceManager:
- jps (Java Virtual Machine Process Status Tool)
To access Web-UI, open browser & go to:
- localhost:9870: NameNode Information
- localhost:9864: DataNode Information
- localhost:8088: Resource Manager (YARN)
Stop Hadoop by running following command:
- stop-all.cmd (or stop-dfs.cmd & stop-yarn.cmd)

Apache HBase Installation

Download HBase Binary (v2.3.5) Latest Version.
Preferably extract HBase in the same directory where Hadoop is residing.
Make new folders named hbase & zookeeper in root directory of HBase.
Open hbase.cmd file placed in <hbase bin> folder &
- Search for java_arguments as variable.
- Remove %HEAP_SETTINGS% from the RHS.
Open hbase-env.cmd file placed in <hbase conf> folder & add following lines:

set JAVA_HOME=%JAVA_HOME%
set HBASE_CLASSPATH=%HBASE_HOME%\lib\client-facing-thirdparty\*
set HBASE_HEAPSIZE=8000
set HBASE_OPTS="-XX:+UseConcMarkSweepGC" "-Djava.net.preferIPv4Stack=true"
set SERVER_GC_OPTS="-verbose:gc" "-XX:+PrintGCDetails" "-XX:+PrintGCDateStamps" %HBASE_GC_OPTS%
set HBASE_USE_GC_LOGFILE=true

set HBASE_JMX_BASE="-Dcom.sun.management.jmxremote.ssl=false" "-Dcom.sun.management.jmxremote.authenticate=false"

set HBASE_MASTER_OPTS=%HBASE_JMX_BASE% "-Dcom.sun.management.jmxremote.port=10101"
set HBASE_REGIONSERVER_OPTS=%HBASE_JMX_BASE% "-Dcom.sun.management.jmxremote.port=10102"
set HBASE_THRIFT_OPTS=%HBASE_JMX_BASE% "-Dcom.sun.management.jmxremote.port=10103"
set HBASE_ZOOKEEPER_OPTS=%HBASE_JMX_BASE% "-Dcom.sun.management.jmxremote.port=10104"
set HBASE_REGIONSERVERS=%HBASE_HOME%\conf\regionservers
set HBASE_LOG_DIR=%HBASE_HOME%\logs
set HBASE_IDENT_STRING=%USERNAME%
set HBASE_MANAGES_ZK=true

Open hbase-site.xml file placed in <hbase conf> folder & add following lines inside <configuration> tag:

<property>
    <name>hbase.rootdir</name>
    <value>file:///E:/Rohit/Hadoop/HBase/hbase</value>
</property>
<property>
    <name>hbase.zookeeper.property.dataDir</name>
    <value>/E:/Rohit/Hadoop/HBase/zookeeper</value>
</property>
<property>
    <name>hbase.zookeeper.quorum</name>
    <value>localhost</value>
</property>

Setting Environment Path Variables:
- Set Variable as HBASE_HOME & Value as <HBase Root Path>.
- Set Variable as HBASE_BIN_PATH & Value as <HBase Bin Path>.
- Add <HBase Bin Path> path to Path Variable.
Start HBase by opening Terminal as Administrator & by running following commands:
- start-all.cmd (Hadoop)
- start-hbase.cmd (HBase)
To interact with HBase, run following command: hbase shell.
Start HBase by running following command: stop-hbase.cmd.

Apache Hive Installation

Download Relational Database - Apache Derby Binary (v10.14.2.0) Latest Version to create its Metastore (where all metadata will be stored).
Preferably extract Derby in the same directory where Hadoop is residing.
Download Cygwin (v3.2.0) Latest Version & Install it.
Download Hive Binary (v3.1.2) Latest Version.
Preferably extract Hive in the same directory where Hadoop is residing.
Setting Environment Path Variables:
- Set Variable as HIVE_HOME & Value as <Hive Root Path>.
- Set Variable as DERBY_HOME & Value as <Dirby Root Path>.
- Set Variable as HIVE_LIB & Value as <Hive Lib Path>.
- Set Variable as HIVE_BIN & Value as <Hive Bin Path>.
- Set Variable as HADOOP_USER_CLASSPATH_FIRST & Value as true.
- Add following paths to Path Variable:
  - <Dirby Bin Path>
  - <Hive Bin Path>
Copy files from Derby Lib folder to Hive Lib folder.
Create a new file named hive-site.xml in <hive conf> folder & add following lines:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
	<property>
		<name>javax.jdo.option.ConnectionURL</name> 
		<value>jdbc:derby://localhost:1527/metastore_db;create=true</value> 
		<description>JDBC connect string for a JDBC metastore</description>
	</property>
	<property> 
		<name>javax.jdo.option.ConnectionDriverName</name> 
		<value>org.apache.derby.jdbc.ClientDriver</value> 
		<description>Driver class name for a JDBC metastore</description>
	</property>
	<property> 
		<name>hive.server2.enable.doAs</name> 
		<description>Enable user impersonation for HiveServer2</description>
		<value>true</value>
	</property>
	<property>
		<name>hive.server2.authentication</name> 
		<value>NONE</value>
		<description> Client authentication types. NONE: no authentication check LDAP: LDAP/AD based authentication KERBEROS: Kerberos/GSSAPI authentication CUSTOM: Custom authentication provider (Use with property hive.server2.custom.authentication.class) </description>
	</property>
	<property>
		<name>datanucleus.autoCreateTables</name>
		<value>True</value>
	</property>
</configuration>

Download extra cmd files for Windows support from this & replace in Hive bin directory along with sub-directories.
Replace the Hive guava-19.0.jar stored in Hive Lib with Hadoop’s guava-27.0-jre.jar found in hadoop\share\hadoop\hdfs\lib.
Make new directories in following locations as:
- E:\cygdrive
- C:\cygdrive
Open the Terminal as Administrator and execute the following commands for symbolic links:
- mklink /J E:\cygdrive\e\ E:\
- mklink /J C:\cygdrive\c\ C:\
Start Derby by opening Terminal as Administrator & by running following command: StartNetworkServer -h 0.0.0.0
Open Cygwin utility and execute the following command: cygstart ~/.bashrc & add following lines:

export HADOOP_HOME='/cygdrive/e/Rohit/Hadoop/hadoop'
export PATH=$PATH:$HADOOP_HOME/bin
export HIVE_HOME='/cygdrive/e/Rohit/Hadoop/hive'
export PATH=$PATH:$HIVE_HOME/bin
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HIVE_HOME/lib/*.jar

Comment 2 lines in file hive-schema-3.1.0.derby.sql in hive\scripts\metastore\upgrade\derby folder containing:
- Line 1: CREATE FUNCTION "APP"."NUCLEUS_ASCII" ...
- Line 2: CREATE FUNCTION "APP"."NUCLEUS_MATCHES" ...
Inside Cygwin utility, goto hive-bin by cd $HIVE_HOME/bin & run command: schematool -dbType derby -initSchema for Initializing Hive Metastore.
Start Hive by opening Terminal as Administrator & by running following commands:
- start-all.cmd (Hadoop)
- hadoop dfsadmin -safemode leave (Disabling SafeMode of Hadoop)
- hive --service hiveserver2 start (HiveServer2 service)
- hive (Apache Hive)

Apache Pig Installation

Download Pig Binary (v0.17.0) Latest Version.
Note: The Apache Pig v0.17.0 supports Hadoop 2.x versions & it is facing some compatibility issues with Hadoop 3.x.
Preferably extract Pig in the same directory where Hadoop is residing.
Setting Environment Path Variables:
- Set Variable as PIG_HOME & Value as <Pig Root Path>.
- Add following path to Path Variable: <Pig Bin Path>
Make change of HADOOP_BIN_PATH from %HADOOP_HOME%\bin to %HADOOP_HOME%\libexec in pig.cmd file located in Pig Bin folder.
To check if pig is installed properly, run command: pig -version.
The PigLatin statements can be run in two ways:
- Local: All scripts are executed on a single machine without requiring Hadoop. (command: pig -x local)
- MapReduce: Scripts are executed on a Hadoop cluster (command: pig -x MapReduce)

Note:

All the customs paths mentioned above has to be configured according your own system.
All the above installation steps are collected from various source available on internet. I have just cumulated them together here.
This guide may not be updated for later versions or other components of Apache.
If there is any issues, please contact through Email or If you want to contribute, create a pull request.

Rohit-Jain-2801 / HadoopInstallGuide

HadoopInstallGuide

Apache Hadoop Installation

Apache HBase Installation

Apache Hive Installation

Apache Pig Installation

About