The HDFS-related services are NameNode
, SecondaryNameNode
and DataNode
.
The YARN-related services are ResourceManager
, NodeManager
and WebAppProxy
.
If MapReduce is to be used, then the MapReduce Job History Server will also be running.
You should put the following services in a central server:
NameNode
(running under system userhdfs
)ResourceManager
(running under system useryarn
)- Job History Server (running user user system
mapred
)
You should put the following services into one or more node servers:
DataNode
(running under system userhdfs
)NodeManager
(running under system useryarn
)
Hadoop requires a number of system users to work more safely. In particular, you need three system users (hdfs
, yarn
and mapred
) and one system group (hadoop
). You can setup and customize all those users/groups in a Debian-like machine using the following commands:
addgroup --system hadoop
adduser --system --no-create-home --disabled-password --disabled-login --home /opt/hadoop hdfs
adduser --system --no-create-home --disabled-password --disabled-login --home /opt/hadoop yarn
adduser --system --no-create-home --disabled-password --disabled-login --home /opt/hadoop mapred
adduser hdfs hadoop
adduser yarn hadoop
adduser mapred hadoop
Please note that /opt/hadoop
is the directory where you decided to unpack the Hadoop binary distribution (see below). Adjust this path accordingly during installation of the Hadoop instance.
To make use of native system libraries for compression in Hadoop, you will require to have the following packages installed in a Debian-like machine:
apt-get install zlib1g
apt-get install libsnappy1
apt-get install libssl-dev
Note: the default Hadoop binary distribution is not compiled with support for native BZip2 compression.
The steps below assume you are using the root
user. The installation will be done in the /opt
system directory.
Inside the /opt
directory we will have the following directory layout:
/opt/apache-hadoop-VERSION
/opt/apache-hadoop -> apache-hadoop-VERSION
/opt/hadoop
The /opt/apache-hadoop-VERSION
directory is an unpacked Hadoop binary distribution of a certain VERSION (i.e., 2.7.1
). The /opt/apache-hadoop
symlink points to the Hadoop binary distribution installation you wish to use (normally you would use the latest version). Finally, the /opt/hadoop
directory contains the local instance with custom configuration and datafiles (i.e. HDFS objects). This directory only contains data and configuration files, not binaries! (see later).
This directory structure is designed to make Hadoop version upgrades much easier. See the Upgrade section for details.
-
Download the latest Hadoop binary distribution from here.
-
We will unpack the binary distribution in the
/opt
system directory.tar zxvf /path/to/downloaded/hadoop-VERSION.tar.gz chown -R root.root hadoop-VERSION
-
Create the proposed directory layout for easy versioning:
mv hadoop-VERSION apache-hadoop-VERSION ln -sfn apache-hadoop-VERSION apache-hadoop
-
Setup the local instance directory:
git clone https://github.com/hhromic/easy-hadoop-deployment hadoop cd hadoop scripts/setup.sh
-
The
setup.sh
script will ask questions for customizing your Hadoop instance. It will try to do its best to guess the best values that you should use.
Once you have a working central Hadoop installation, all you have to do is to copy the installed directories into all your designated node machines. You don't need to change any configuration setting, but just launch the node-related services (see the Starting/Stopping Services section).
Before transferring the installation to the node servers, please follow with the next section (Configuration).
[TODO]
[TODO]
[TODO]