Raw table Sqoop import script

This module contains an example Sqoop import script for any number of tables.

If you are new to this area, it could be useful as a starting point. Feel free to use it in your own projects.

How it works

The list of tables to be imported is defined in tables.txt. The file's columns are self-explanatory. Commented and blank lines are ignored.
The script assumes that you have a HiveServer2 available to connect to.
It attempts to delete the previous copy of the data from HDFS
It lets Sqoop take care of import to Hive
Tables are processed one at a time and any errors are printed on the console.

Data type mapping

Use this section if you need to process a specific DB column which is not recognized by default.

You can add multiple DB column names as a comma-separated list, in the form columnName=DateType.

The possible datatypes you can use are:

Integer
Long
Float
Double
Boolean
String
java.sql.Date
java.sql.Time
java.sql.Timestamp
java.math.BigDecimal
com.cloudera.sqoop.lib.ClobRef (for CLOB DB columns, will become a Hive BINARY column)
com.cloudera.sqoop.lib.BlobRef (for CLOB DB columns, will become a Hive BINARY column)

Important: Only use this functionality if a column is causing issues or you want to optimize some downstream process. In most cases, Hive's default translation is the optimal.

LOB

Large objects are always imported as non-materialized objects, i.e. the data is stored in a separate sub-folder.

See

Hive's Large Objects documentation
Detailed explanation in Hadoop: The Definitive Guide

Configuration

Before you can use the script on your environment, make sure that

you review and set the right environment values in sqoop-import.sh
you use the right JDBC URL in import.sh (the example one is for SQL Server)
you modify the Hive table paths and types import.sh

Usage

Copy or checkout the project in your Sqoop client machine
cd /path/to/sqoop-import/scripts
Run ./sqoop-import.sh; there should be no errors on the output

Making changes

When working with the scripts on Windows or Mac, make sure that your editor does NOT change the line ending of the Bash scripts. It should be "Linux line feed".

For example, in IntelliJ the correct setting is highlighted below.

Possible improvements

You may want to consider the following as possible improvements (or as exercise), depending on your needs

adding support for multiple mapped columns
nohup-ing the execution of the actual import to parallelize the import in the case of many tables

handanchen / sqoop-import-example