Google Compute Engine Cluster for Hadoop

Copyright

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Disclaimer

The application is not an official Google product.

Summary

The application sets up Google Compute Engine instances as a Hadoop cluster and executes MapReduce tasks.

The purpose of the application is to demonstrate how to leverage Google Compute Engine for parallel processing with MapReduce on Hadoop.

The application is not meant to maintain a persistent Hadoop cluster. Because the application is designed to delete all disks at the deletion of the cluster, all data on the Hadoop cluster instances, including data in HDFS (Hadoop Distributed FileSystem), will be lost when the Hadoop cluster is torn down. For persistent storage beyond the lifespan of the cluster, use external persistent storage, such as Google Cloud Storage.

The application takes advantage of MapReduce tasks to parallelize the copy of input and output of MapReduce between Google Cloud Storage and HDFS.

Prerequisites

The application assumes Google Cloud Storage and Google Compute Engine services are enabled on the project. The application requires sufficient Google Compute Engine quota to run a Hadoop cluster.

gcutil and gsutil

The application uses gcutil and gsutil, command line tools for Google Compute Engine and Google Cloud Storage respectively. The tools are distributed as a part of Google Cloud SDK. Follow the instruction in the link to install them.

Authenticaton and default project

After the installation of Cloud SDK, run the following command to authenticate, as instructed on the page.

gcloud auth login

The above command authenticates the user, and sets the default project. The default project must be set to the project where the Hadoop cluster will be deployed. The project ID is shown at the top of the project's "Overview" page of the Developers Console.

The default project can be changed with the command:

gcloud config set project <project ID>

SSH key

The application uses SSH to start MapReduce tasks, on the Hadoop cluster. In order for the application to execute remote commands automatically, gcutil must have an SSH key with an empty passphrase for Google Compute Engine instances.

gcutil uses its own SSH key, and the key is stored in .ssh directory of the user's home directory as $HOME/.ssh/google_compute_engine (private key) and $HOME/.ssh/google_compute_engine.pub (public key).

If the SSH key for gcutil doesn't already exist, it can be created with the command:

ssh-keygen -t rsa -q -f $HOME/.ssh/google_compute_engine

Alternatively, the SSH key for gcutil can be generated upon the creation of an instance. The following command creates an instance, and an SSH key if it doesn't exist on the computer.

gcutil addinstance --zone us-central1-a --machine_type f1-micro  \
    --image debian-7-wheezy <instance-name>

The instance is created only to create an SSH key, and is safe to delete.

gcutil deleteinstance -f --delete_boot_pd <instance-name>

If an SSH key with a passphrase already exists, the key files must be renamed or deleted before creating the SSH key with an empty passphrase.

Environment

The application runs with Python 2.7. It's tested on Mac OS X and Linux.

Alternatively, a Google Compute Engine instance can be used to run the application, which works as a controller of the Hadoop cluster.

Known Issues

On HDFS NameNode Web console (http://<master external IP address>:50070), "Browse the filesystem" link does not work. In order to browse HDFS from Web UI, go to the live node list from the "Live Nodes" link, and click the IP address of an arbitrary DataNode.
The script does not work properly if the path to the application directory contains whitespace. Please download the application to a path that doesn't include any whitespaces.
The application creates a file named .hadoop_on_compute.credentials (hidden file) under the home directory of the user to cache OAuth2 information. It is automatically created the first time access is authorized by the user. In case incorrect information is cached, leading to Google Compute Engine API access failure, removal of the file allows redoing the authentication.
Without additional security consideration, which falls outside the scope of the application, Hadoop's Web UI is open to public. Some resources on the Web are:
- Authentication for Hadoop HTTP web-consoles
- Google Compute Engine: Setting Up VPN Gateways

Set up Instruction

Setting up the environment should be done in the directory of the application ("root directory of the application"). In the following instruction, commands are expected to run from the root directory.

Prepare Hadoop package

Download Hadoop package

Download Hadoop 1.2.1 package (hadoop-1.2.1.tar.gz) from one of Apache Hadoop mirror sites or from Apache Hadoop archives.

Put the downloaded package in the root directory of the application, where hadoop-1.2.1.patch exists. Download can be performed from your Web browser and the file can be copied to the working directory. Alternatively, command line tools, such as curl or wget may be used.

curl -O http://archive.apache.org/dist/hadoop/core/hadoop-1.2.1/hadoop-1.2.1.tar.gz

Customize Hadoop configuration and re-package

Apply hadoop-1.2.1.patch to customize Hadoop configuration. From the root directory of the application, execute the following commands.

tar zxf hadoop-1.2.1.tar.gz
patch -p0 < hadoop-1.2.1.patch
tar zcf hadoop-1.2.1.tar.gz hadoop-1.2.1

Hadoop configurations can be modified after the patch is applied in the above steps so as to include the custom configurations.

Download Open JDK and Dependent Packages

The application uses Open JDK as Java runtime environment. Open JDK Java Runtime Environment is distributed under GNU Public License version 2. User must agree to the license to use Open JDK.

Create a directory called "deb_packages" under the root directory of this application.

Download amd64 package of openjdk-6-jre-headless, and architecture-common package of openjdk-6-jre-lib from the following sites.

Also download dependent packages.

http://packages.debian.org/wheezy/libnss3-1d [amd64]
http://packages.debian.org/wheezy/libnss3 [amd64]
http://packages.debian.org/wheezy/ca-certificates-java [architecture-common]
http://packages.debian.org/wheezy/libnspr4 [amd64]

Put the downloaded packages into the deb_packages directory.

mkdir deb_packages
cd deb_packages
curl -O http://security.debian.org/debian-security/pool/updates/main/o/openjdk-6/openjdk-6-jre-headless_6b27-1.12.6-1~deb7u1_amd64.deb
curl -O http://security.debian.org/debian-security/pool/updates/main/o/openjdk-6/openjdk-6-jre-lib_6b27-1.12.6-1~deb7u1_all.deb
curl -O http://http.us.debian.org/debian/pool/main/n/nss/libnss3-1d_3.14.3-1_amd64.deb
curl -O http://http.us.debian.org/debian/pool/main/n/nss/libnss3_3.14.3-1_amd64.deb
curl -O http://http.us.debian.org/debian/pool/main/c/ca-certificates-java/ca-certificates-java_20121112+nmu2_all.deb
curl -O http://http.us.debian.org/debian/pool/main/n/nspr/libnspr4_4.9.2-1_amd64.deb
cd ..

Prepare Google Cloud Storage bucket

Create a Google Cloud Storage bucket, from which Google Compute Engine instance downloads Hadoop and other packages.

This can be done by one of:

Using an existing bucket.
Creating a new bucket from the "Cloud Storage" page on the project page of Developers Console
Creating a new bucket by gsutil command line tool. gsutil mb gs://<bucket name>

Note this bucket can be different from the bucket where MapReduce input and output are located. Make sure to create the bucket in the same Google Cloud project as that specified in the "Default project" section above.

Create client ID and client secret

Client ID and client secret are required by OAuth2 authorization to identify the application. It is required in order for the application to access Google API (in this case, Google Compute Engine API) on behalf of the user.

Client ID and client secret can be set up from "APIs & auth" menu of Developers Console of the project. Choose "Credentials" submenu, and click the red button at the top labeled "CREATE NEW CLIENT ID" to create a new pair of client ID and client secret for the application.

Choose "Installed application" as "Application type", choose "Other" as "Installed application type" and click "Create Client ID" button.

Replace CLIENT_ID and CLIENT_SECRET values in GceCluster class in gce_cluster.py with the values created in the above step.

CLIENT_ID = '12345....................com'
CLIENT_SECRET = 'abcd..........'

Download and set up Python libraries

The following instructions explain how to set up the additional libraries for the application.

Google Client API

Google Client API is library to access various Google's services via API.

Download google-api-python-client-1.2.tar.gz from download page or by the following command.

curl -O http://google-api-python-client.googlecode.com/files/google-api-python-client-1.2.tar.gz

Set up the library in the root directory of the application.

tar zxf google-api-python-client-1.2.tar.gz
ln -s google-api-python-client-1.2/apiclient .
ln -s google-api-python-client-1.2/oauth2client .
ln -s google-api-python-client-1.2/uritemplate .

Httplib2

Httplib2 is used by Google Client API internally. Download httplib2-0.8.tar.gz from download page. or by the following command.

curl -O https://httplib2.googlecode.com/files/httplib2-0.8.tar.gz

Set up the library in the root directory of the application.

tar zxf httplib2-0.8.tar.gz
ln -s httplib2-0.8/python2/httplib2 .

Python gflags

gflags is used by Google Client API internally. Download python-gflags-2.0.tar.gz from download page. or by the following command.

curl -O http://python-gflags.googlecode.com/files/python-gflags-2.0.tar.gz

Set up the library in the root directory of the application.

tar zxf python-gflags-2.0.tar.gz
ln -s python-gflags-2.0/gflags.py .
ln -s python-gflags-2.0/gflags_validators.py .

Python mock (required only for unit tests)

mock is mocking library for Python. It will be included in Python as standard package from Python 3.3. However, since the application uses Python 2.7, it needs to be set up.

Download mock-1.0.1.tar.gz from download page. or by the following command.

curl -O https://pypi.python.org/packages/source/m/mock/mock-1.0.1.tar.gz

Set up the library in the root directory of the application.

tar zxf mock-1.0.1.tar.gz
ln -s mock-1.0.1/mock.py .

Usage of the Application

compute_cluster_for_hadoop.py

compute_cluster_for_hadoop.py is the main script of the application. It sets up Google Compute Engine and Google Cloud Storage environment for Hadoop cluster, starts up Hadoop cluster, initiates MapReduce jobs and tears down the cluster.

Show usage

./compute_cluster_for_hadoop.py --help

compute_cluster_for_hadoop.py has 4 subcommands, setup, start, mapreduce and shutdown. Please refer to the following usages for available options.

./compute_cluster_for_hadoop.py setup --help
./compute_cluster_for_hadoop.py start --help
./compute_cluster_for_hadoop.py mapreduce --help
./compute_cluster_for_hadoop.py shutdown --help

Set up environment

'setup' subcommand sets up environment.

Create SSH key used by communication between Hadoop instances.
Upload Hadoop and Open JDK packages to Google Cloud Storage, so that Google Compute Engine instances can download them to set up Hadoop.
Set up firewall in Google Compute Engine network to allow users to access Hadoop Web consoles.

'setup' must be performed at least once per combination of Google Compute Engine project and Google Cloud Storage bucket. 'setup' may safely be run repeatedly for the same Google Compute Engine project and/or Google Cloud Storage bucket. The project ID can be found in the "PROJECT ID" column of the project list of the Developers Console or at the top of the project's page on Developers Console. Bucket name is the name of Google Cloud Storage bucket without the "gs://" prefix.

Execute the following command to set up the environment.

./compute_cluster_for_hadoop.py setup <project ID> <bucket name>

Start cluster

'start' subcommand starts Hadoop cluster. By default, it starts 6 instances: one master and 5 worker instances. In other words, the default value of the number of workers is 5.

./compute_cluster_for_hadoop.py start <project ID> <bucket name>  \
    [number of workers] [--prefix <prefix>]  \
    [--data-disk-gb <size>]

Each instance is started with a Persistent Disk used for HDFS. The size of the disk can be set with the --data-disk-gb parameter. The default size is 500GB. Review the documentation for help in determining the disk size that provides the right performance for the cluster.

If the instance is started for the first time, the script requires log in and asks for authorization to access Google Compute Engine. By default, the command opens Web browser for the authorization. If the script is run in remote host on terminal (on SSH, for example), it cannot open Web browser on local machine. In this case, --noauth_local_webserver option can be specified as instructed by the message as follows.

./compute_cluster_for_hadoop.py --noauth_local_webserver start  \
    <project ID> <bucket name> [number of workers]  \
    [--prefix <prefix>] [--data-disk-gb <size>]

It avoids the attempt to open local Web browser, and it shows URL for authentication and authorization. When authorization is successful on the Web browser, the page shows code to paste on the terminal. By pasting the correct code, authorization process is complete in the script. The script can then access Google Compute Engine through API.

As shown on console log, HDFS and MapReduce Web consoles are available at http://<master external IP address>:50070 and http://<master external IP address>:50030 respectively.

'start' subcommand accepts custom command executed on each instance by --command option. For example, it can be used to install required software and/or to download necessary files onto each instance. The custom command is executed under the permission of the user who started the instance. If superuser privilege is required for the command execution, use sudo in the command.

External IP Addresses on Worker Instances

By default, all Google Compute Engine instances created by the application are equipped with external IP addresses.

A Hadoop cluster with no external IP addresses on worker instances can be started by passing --external-ip=master option to 'start' subcommand.

./compute_cluster_for_hadoop.py start <project ID> <bucket name> ... --external-ip=master

Quota and security are some of the reasons to set up cluster without external IP addresses. Note the master instance always has an external IP address, so that MapReduce tasks can be started.

If a Google Compute Engine instance is created without an external IP address, it cannot access the Internet directly, including accessing Google Cloud Storage. When --external-ip option is set to "master", however, the application sets up routing and NAT via the master instance, so that workers can still access the Internet transparently. Note that all traffic to the Internet from the workers, then, goes through the single master instance. When the workers are created with external IP addresses (the default), their traffic goes directly to the Internet.

Start MapReduce

'mapreduce' subcommand starts MapReduce task on the Hadoop cluster. It requires project name and bucket for the temporary use. The temporary bucket is used to transfer mapper and reducer programs. It may or may not be the same as the bucket used to set up the cluster.

Input files must be located in a single directory on Google Cloud Storage, from where they are copied to Hadoop cluster as mapper's input. The directory on Google Cloud Storage or a single file on Google Cloud Storage can be specified as input.

--input and --output parameters are required, and must point to directory on Google Cloud Storage, starting with "gs://". Alternatively, --input can be a single file on Google Cloud Storage, still starting with "gs://". --input, --output and temporary bucket can belong to the different buckets.

The output of MapReduce task is copied to the specified output directory on Google Cloud Storage. The existing files in the directory may be overwritten. The output directory does not need to exist in advance.

The command uses Hadoop streaming MapReduce processing. The mapper and the reducer must be programmed to read input from standard input and write output to standard output. If local file is specified as mapper and/or reducer, they are copied to Hadoop cluster through Google Cloud Storage. Alternatively, files on Google Cloud Storage may be used as mapper or reducer.

If mapper or reducer requires additional files, such as data files or libraries, it can be achieved by --command option of 'start' subcommand.

If mapper or reducer is not specified, the step (mapper or reducer) copies input to output. Specifying 0 as --reducer-count will skip shuffle and reduce phases, making the output of mapper the final output of MapReduce.

sample directory in the application includes sample mapper and reducer, that counts the words' occurrence in the input files in shortest to longest and in alphabetical order in the same length of the word.

Example:

./compute_cluster_for_hadoop.py mapreduce <project ID> <bucket name> [--prefix <prefix>]
    --input gs://<input directory on Google Cloud Storage>  \
    --output gs://<output directory on Google Cloud Storage>  \
    --mapper sample/shortest-to-longest-mapper.pl  \
    --reducer sample/shortest-to-longest-reducer.pl  \
    --mapper-count 5  \
    --reducer-count 1

Shut down cluster

'shutdown' subcommand deletes all instances in the Hadoop cluster.

./compute_cluster_for_hadoop.py shutdown <project ID> [--prefix <prefix>]

The application is designed to delete Google Compute Engine instances and disks when the cluster is shut down. Therefore, all files on the Hadoop cluster, including HDFS, are deleted when the cluster is shut down. For persistent storage beyond the lifespan of the cluster, use external persistent storage, such as Google Cloud Storage.

Prefix and zone

start, mapreduce and shutdown subcommands take string value as "--prefix" parameter. The prefix specified is prepended to instance names of the cluster. In order to start up multiple clusters in the same project, specify different prefix, and run mapreduce or shutdown commands with the appropreate prefix.

There are restrictions in the prefix string.

Prefix must be 15 letters or less.
Only lower case alphanumerical letters and hyphens can be used.
The first letter must be lower case alphabet.

Similarly, if zone is specified by --zone parameter in start subcommand, the same zone must be specified for mapreduce and shutdown subcommands, so that the MapReduce task and shutdown are performed correctly.

Cleaning up the environment

After the cluster is shut down, and if the Hadoop cluster environment is not necessary any more, the following commands clean up the environment set up by setup command.

gcutil deletefirewall -f hdfs-datanode hdfs-namenode  \
    hadoop-mr-jobtracker hadoop-mr-tasktracker
gsutil rm -f -r gs://<bucket name>/mapreduce/

The same environment can be again set up by setup command, allowing the subsequent deployments of the Hadoop clusters.

Log in to Hadoop master instance

hadoop-shell.sh provides convenient way to log on to the master instance of Hadoop cluster.

./hadoop-shell.sh [prefix]

When logged on to the master instance, user is automatically switched to Hadoop cluster's superuser ('hadoop'). hadoop command can be used to work on HDFS and Hadoop.

Example:

hadoop dfs -ls /

Unit tests

The application has 3 Python files, compute_cluster_for_hadoop.py, gce_cluster.py and gce_api.py. They have corresponding unit tests, compute_cluster_for_hadoop_test.py, gce_cluster_test.py and gce_api_test.py respectively.

Unit tests can be directly executed.

./compute_cluster_for_hadoop_test.py
./gce_cluster_test.py
./gce_api_test.py

Note some unit tests simulate error conditions, and those tests shows error messages.

espeed / solutions-google-compute-engine-cluster-for-hadoop

Google Compute Engine Cluster for Hadoop

Copyright

Disclaimer

Summary

Prerequisites

gcutil and gsutil

Authenticaton and default project

SSH key

Environment

Known Issues

Set up Instruction

Prepare Hadoop package

Download Hadoop package

Customize Hadoop configuration and re-package

Download Open JDK and Dependent Packages

Prepare Google Cloud Storage bucket

Create client ID and client secret

Download and set up Python libraries

Google Client API

Httplib2

Python gflags

Python mock (required only for unit tests)

Usage of the Application

compute_cluster_for_hadoop.py

Show usage

Set up environment

Start cluster

External IP Addresses on Worker Instances

Start MapReduce

Shut down cluster

Prefix and zone

Cleaning up the environment

Log in to Hadoop master instance

Unit tests

About