This module contains an example Sqoop import script for any number of tables.
If you are new to this area, it could be useful as a starting point. Feel free to use it in your own projects.
- The list of tables to be imported is defined in tables.txt. The file's columns are self-explanatory. Commented and blank lines are ignored.
- The script assumes that you have a HiveServer2 available to connect to.
- It attempts to delete the previous copy of the data from HDFS
- It lets Sqoop take care of import to Hive
- Tables are processed one at a time and any errors are printed on the console.
Use this section if you need to process a specific DB column which is not recognized by default.
You can add multiple DB column names as a comma-separated list, in the form columnName=DateType
.
The possible datatypes you can use are:
Integer
Long
Float
Double
Boolean
String
java.sql.Date
java.sql.Time
java.sql.Timestamp
java.math.BigDecimal
com.cloudera.sqoop.lib.ClobRef
(forCLOB
DB columns, will become a HiveBINARY
column)com.cloudera.sqoop.lib.BlobRef
(forCLOB
DB columns, will become a HiveBINARY
column)
Important: Only use this functionality if a column is causing issues or you want to optimize some downstream process. In most cases, Hive's default translation is the optimal.
Large objects are always imported as non-materialized objects, i.e. the data is stored in a separate sub-folder.
See
- Hive's Large Objects documentation
- Detailed explanation in Hadoop: The Definitive Guide
Before you can use the script on your environment, make sure that
- you review and set the right environment values in sqoop-import.sh
- you use the right JDBC URL in import.sh (the example one is for SQL Server)
- you modify the Hive table paths and types import.sh
- Copy or checkout the project in your Sqoop client machine
cd /path/to/sqoop-import/scripts
- Run
./sqoop-import.sh
; there should be no errors on the output
When working with the scripts on Windows or Mac, make sure that your editor does NOT change the line ending of the Bash scripts. It should be "Linux line feed".
For example, in IntelliJ the correct setting is highlighted below.
You may want to consider the following as possible improvements (or as exercise), depending on your needs
- adding support for multiple mapped columns
- nohup-ing the execution of the actual import to parallelize the import in the case of many tables