HBacker

Ruby Tool to Export / Import HBase to S3, HDFS or local file system.

THIS HAS BARELY BEEN TESTED

It seeems to work. Still testing it. Haven’t exercised the incremental export / import.

NOTE: The incremental export/import will not handle deletes to HBase. There is no way to know that there was a delete. Our use case does not have deletes. We never delete rows, so its fine for that.

Please feel free to fork, but please do send pull requests or generate issue reports if you find problems.

Dependencies

hbase 0.20.3 – hbase 0.90.x
hadoop 0.20.2
Stargate installed on hbase master
beanstalkd
beanstalkd ruby gem
stalker ruby gem
AWS SimpleDB
right_aws ruby gem
hbase-stargate ruby gem
right_aws ruby gem
yard ruby gem
bundler ruby gem

Setting up Prerequisites

Stargate

From the hbase_stargate docs:

Download and unpack the most recent release of HBase from http://hadoop.apache.org/hbase/releases.html#Download
Edit /conf/hbase-env.sh and uncomment/modify the following line to correspond to your Java home path: export JAVA_HOME=/usr/lib/jvm/java-6-sun
Copy /contrib/stargate/hbase—stargate.jar into /lib
Copy all the files in the /contrib/stargate/lib folder into /lib
Start up HBase: $ ${HBASE_HOME}/bin/start-hbase.sh
Start up Stargate (append “-p 1234” at the end if you want to change the port from 8080 default): $ /bin/hbase org.apache.hadoop.hbase.stargate.Main

beanstalkd

If you are lucky (Unbuntu 10 onward)

apt-get install beanstalkd

From source (only tested on Ubuntu 9.10)


sudo apt-get update
sudo apt-get install libevent-1.4-2 libevent-dev 
wget --no-check-certificate https://github.com/downloads/kr/beanstalkd/beanstalkd-1.4.6.tar.gz
tar xzf beanstalkd-1.4.6.tar.gz
cd beanstalkd-1.4.6
./configure
make

Configure HDFS to have S3 credentials

Assuming you will be using S3 for storing the backups, you’ll need to give hadoop hdfs proper access to S3. You will need to update ${HADOOP_HOME}/conf/hdfs-site.xml with something like:


  <property>
    <name>fs.s3.awsAccessKeyId</name>
    <value>YourAwsAccessKeyId</value>
  </property>

  <property>
    <name>fs.s3.awsSecretAccessKey</name>
    <value>YourAwsSecretAccessKey</value>
  </property>

  <property>
    <name>fs.s3n.awsAccessKeyId</name>
    <value>YourAwsAccessKeyId</value>
  </property>

  <property>
    <name>fs.s3n.awsSecretAccessKey</name>
    <value>YourAwsSecretAccessKey</value>
  </property>

You can use AWS IAM to create a user and policy to just allow Hadoop to access specific buckets and so on. That’s a tutorial all on its own.

Configure AWS credentials for Hbacker to access AWS SimpleDB

You will also need to create a yaml file with suitable AWS credentials to access SimpleDB. By default Hbacker will look for the file in ~/.aws/aws.config. The format of that file is:


	access_key_id: YourAwsAccessKeyId
	secret_access_key: YourAwsSecretAccessKey

They don’t have to be the same keys as Hadoop HDFS uses.

Starting things

Have beanstalkd running

beanstalkd -d will run beanstalk in daemon mode on port 11300

A nice tool to monitor what’s going on with your beanstalkd is https://github.com/dustin/beanstalk-tools
Assuming you are within the directory of beanstalk-tools you can execute:
./bin/beanstalk-queue-stats.rb localhost:11300

Start the workers

YOU MUST START SOME WORKERS BEFORE RUNNING HBACKER

A worker job is based on a single HBase Table. Normally it will record the start, status and table schema to SimpleDb and then run the Hadoop job to export or import the table. It waits around to handle errors and record the completion.

The number of workers you spawn should be around the number of simultaneous Tables your Hadoop Cluster can operate on parallel. A big table (having a lot of rows and/or non-nil columns) will use multipe Map / Reduce jobs at once. So its a bit hard to figure it all out in advance.

The Hbacker master process will stop making more job requests if the Hadoop cluster gets too busy or if the Hbacker worker queue starts to backup. But its not very scientific yet… There are a bunch of hbacker command line parameters you can override some of this behavior.

Right now the easiest way to start the workers is something like this on the machine that will be making the hadoop requests (The hbase master works fine for this). Run it as a user that has the proper permissions to run the hadoop hbase jobs

This will run 16 workers:
for i in {1..4}; do bundle exec bin/hbacker_worker 1> LOGS/worker_$i.log 2>&1 & done

Start the main process

Best to run this as the user who has hadoop access to run the hadoop map reduce jobs.

hbacker help will tell you the commands available

hbacker help export will tell you the options for the export (backup) command

Sample export command:
This will export all tables from the cluster whose master is hbase-master.example.com to S3 bucket example-hbase-backups. It will create a “sub-directory” in the bucket whose name will be the timestamp based somewhat on “Time.now.utc” and then put a “sub-directory” per Hbase table under example-hbase-backups//

hbacker export -H hbase-master.example.com -D s3n://example-hbase-backups/ --mapred-max-jobs=18 -a &> main.log &

Issues and Cavets

Export / Import to/from HDFS should work, but is untested.
When doing Export / Import to/from HDFS, the command history is stored in a directory within the current directory where the hbacker program is run for now. Eventually it will be stored in HDFS as well.

Example Usage

You only need the bundle exec if you are running this within the hbacker source directory. If its installed as a gem you would not prefix the command with bundle exec bin/
h3. Startup 4 workers

for i in {1..4}; do bundle exec bin/hbacker_worker > /tmp/dump/worker_${i}.log 2>&1 & done

for i in {1..4}; do hbacker_worker > /tmp/dump/worker_${i}.log 2>&1 & done

Running export to a local filesystem

bundle exec bin/hbacker export --hadoop_home=/absolute/path/to/hadoop --hbase_home=/absolute/path/to/hbase \ -H localhost -D file:///tmp/dump/ -t <table-name>

hbacker export --hadoop_home=/absolute/path/to/hadoop --hbase_home=/absolute/path/to/hbase -H localhost \ -D file:///tmp/dump/ -t <table-name>

Another example:

hbacker export --hadoop_home=/apps/hadoop --hbase_home=/apps/hbase -H localhost -D file:///home/rberger/work/dump/ \ -t furtive_staging_consumer_events_3413a157-4bec-4c9c-b080-626ce883202d --hbase_port 8090

Running import from a local filesystem

bundle exec bin/hbacker import --session-name=<timestamp-dir> --source-root=file:///tmp/dump/ \ --import-hbase-host=localhost -H localhost --hadoop_home=/absolute/path/to/hadoop --hbase_home=/absolute/path/to/hbase \ -t <table-name>

hbacker import --session-name=<timestamp-dir> --source-root=file:///tmp/dump/ --import-hbase-host=localhost \ -H localhost --hadoop_home=/absolute/path/to/hadoop --hbase_home=/absolute/path/to/hbase -t <table-name>

hbacker import --session-name=20110802_060004 --hadoop_home=/apps/hadoop --hbase_home=/apps/hbase -H localhost \ -I localhost --source-root=file:///home/rberger/work/dump/ \ -t rob_test_furtive_staging_consumer_events_3413a157-4bec-4c9c-b080-626ce883202d --hbase_port 8090

Todos and other notes for the future

Have an easy way to continue a backup that was aborted with minimal redundant copying on recovery.
Easy way to say do an incremental export based on the last export (right now have to manually figure out the parameters for the next increment)
Make sure Export/Import to/from HDFS work and the command logs are stored there too.
Replace/Update the standard HBase export/import module to get some metrics out
* Number of rows exported/imported
* Some kind of hash or other mechanism to use for data integrity checking
Propose a mechanism to be added to HBase to robustly log deletes. That could be used with backup schemes like this to use a delete log to allow for restores of tables that have had deleted rows.

rberger / hbacker