Hive is a good tool for analysis of data. Hive runs on hadoop and therefore utilises the capabilities of hadoop i.e distributed file system and parallel processing of data, mapreduce among others. This implementation provides a simple guideline for implementing a data warehouse in hive. Here we are going to convert the sun models into star models. That is to say we provide fields with data types and then transalate the the star schemas into hive code to build the warehouse. Prior to this we need to validate some rules.
- An interlink matrix(containing the facts and dimensions) should be clearly set. Dimensions on the left side and facts on the top row.Remove any duplicates.
- Dimensions. The core selectors of the data warehouse. Tables containing information related to the business needs like customers, sales, stores.
- Facts are measures of the analytic model created from the sun model. Example of measure would be customer count.
- Check that the dimensions have the correct type of business needs(does it address the SCDS:type 0;type 1,type 2,type 3,outriggers,fast-changing,etc).
A sun model is an abstract representation of our business for which we'll implement an analytic model
- A retrieve database for collecting and storing data from the business. All the data from different sources is stored here.
- Assess database to validate the data used in for building the warehouse. Here we apply some basic transformations to our data ensure correctness, accuracy among other rules.
- process database is our data vault as it holds the data for the analytic model. An implementation of a process database will include hubs (business key: That rarely change) links (asscociation and transactions between the business keys satellites ( isolated tables for storing the detail attributes)
- Transform database will hold the analytic model(consisting of facts and dimensions)
hive - donwload a stable version from http://hive.apache.org/releases.html preferably verion 2 and above for hadoop-2.7+
NOTE: some hive realeases do not contain the windows executables, I recommend hive-2.1.1
2. unzip the file and place the contents in your preffered location. I used C:\hadoop-2.9.2\hive
3. set up user variables:
HIVE_HOME
as C:\hadoop-2.9.2\hive
HIVE_BIN
as C:\hadoop-2.9.2\hive\bin
HIVE_LIB
as C:\hadoop-2.9.2\hive\lib
add to the path variable C:\hadoop-2.9.2\hive\bin
,C:\hadoop-2.9.2\hive\lib
edit hive-site.mxl depending on the metastore database(I used mysql)
<configuration>
<!-- database connections -->
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/metastore?useSSL=false</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hiveuser</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hivepwd</value>
</property>
<property>
<name>datanucleus.schema.autoCreateTables</name>
<value>true</value>
</property>
property>
<name>datanucleus.schema.autoCreateAll</name>
<value>true</value>
</property>
<!-- Transactional support:ACID -->
<property>
<name>hive.exec.dynamic.partition.mode</name>
<value>nonstrict</value>
</property>
<property>
<name>hive.enforce.bucketing</name>
<value>true</value>
</property>
<property>
<name>hive.support.concurrency</name>
<value>true</value>
</property>
<property>
<name>hive.txn.manager</name>
<value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>
</property>
<property>
<name>hive.compactor.initiator.on</name>
<value>true</value>
</property>
<property>
<name>hive.compactor.worker.threads</name>
<value>10</value>
</property>
</configuration>
- Create directory for hive warehouse in hadoop. Start hadoop daemons and run the following . leave hadoop safemode using
hadoop dfsadmin -safemode leave
hadoop fs -mkdir /tmp
hadoop fs -mkdir /user/hive
hadoop fs -mkdir /user/hive/warehouse
give read/write priviliges to these directories
hadoop fs -chmod 777 /tmp
hadoop fs -chmod 777 /user/hive
hadoop fs -chmod 777 /user/hive/warehouse
- Finally run hive using
hive
command If you run get into errors (hive cannot connect to the metastore database-due to schema problems), Create the schema manually before running hive for the first time. Inhive\scripts\metastore\
there are schemas for the metastore database.Using mysql for the metastore I created a metastore database and run the schema forhive-schema-2.1.0.mysql
against the database. Also the TXN tables are required to hold transaction information(To support database transactions in hive ). Run thehive-txn-schema-2.1.0.mysql
against the database.
csv files for experimentation with the warehouse.These files hold the data for building the warehouse.you can use custom files or agents such as sqoop or flume to get some data into your databases.Included are all the scripts for implementing the complete data warehouse.
run hive -f example.hql
for executing hive scripts containing hive code (easier for debugging). Or using hive prompt hive
command.Two modes of execution can not exist at the same time.
create some data marts(aggregations of the analytic model) for easy reporting and analysis.
Hive by default uses mapreduce as its execution engine, using Tez will improve execution time
using Tez as execution engine for hive set the following properties:
<property> <name>hive.execution.engine</name> <value>tez</value>
hive.prewarm.enabled=true
hive.prewarm.numcontainers=10
using file formats optimized for hive eg ORC, parquet set the following properties for ORC:
orc.compress = SNAPPY
orc.compress.size = 262144
orc.stripe.size= 64 MB
orc.row.index.stride =10000
orc.create.index= true
usage:
CREATE TABLE person_orc STORED AS ORC tblproperties("orc.compress"="SNAPPY") AS SELECT * FROM person;
hive by default executes by processing one row at a time which requires mutliple layers of virtual method calls within the nested loop(inefficient for CPU). Vectorized query execution eliminates this problem by reading rows in batch and applying operations to the entire collection instead of one.
hive.vectorized.execution.enabled=true
Cost based optimization engine This method use stats from tables to optimize queries.First we need to enable stats collection by setting the following property
hive.stats.autogather = true
Then we compute stats for tables we want to query from, using:
ANALYZE TABLE person_orc COMPUTE STATISTICS;
ANALYZE TABLE person_orc COMPUTE STATISTICS FOR COLUMNS;
Now we need to set properties for hive to use the CBO engine
hive.cbo.enable=true
hive.compute.query.using.stats = true
hive.stats.fetch.column.stats = true
hive.stats.fetch.partition.stats = true