The idea of this repo is to provide some simple step by step guide to set up an isolated test/dev version Hive & PrestoDB locally and start playing around with it :-). This set up is totally NOT recommended for production workloads.
Note: As I mentioned in References this work is based in the containers build by Big Data Europe
Because our purpose is just to experiment, to accelerate the set up we could use docker images (to be accurate, docker-compose). First of all we need to clone this repo:
git@github.com:jordicenzano/hive-presto-tutorial.git
cd hive-presto-tutorial
The first step is to pull the docker images and start the cluster based on those images. To do that we'll execute the following command from the root of this project(*):
docker-compose up
(*) We are assuming that you have docker installed and configured, if not take a look to this guide
In our example we'll be analyzing the following file temp-data.csv, here you can see a sample:
2018-01-01T01:00:00,52
2018-01-01T02:00:00,50
2018-01-01T03:00:00,48
2018-01-01T04:00:00,48
2018-01-01T05:00:00,45
...
That file contains the temperature measured every hour from a weather station for all 2018.
The first step to use is to create the table with the data to analyze, we are going to use Hive for that, execute the following command to open shell into a cluster machine:
docker-compose exec hive-server bash
Copy the file to HDFS:
# hdfs dfs -mkdir /input
# hdfs dfs -put /host_shared/data/temp-data.csv /input/temp-data.csv
Connect the JDBC driver:
# /opt/hive/bin/beeline -u jdbc:hive2://localhost:10000
Finally create the table from the data in the file:
> CREATE EXTERNAL TABLE horlytemp(time STRING, temp INT) COMMENT 'temperature from csv file' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE location '/input';
We can check if the data is accessible doing:
> select * from horlytemp;
We should see the list of temperatures.
Now lets execute a presto client:
cd presto_client
./presto.jar --server localhost:8080 --catalog hive --schema default
Finally from that shell we can query that table, for instance:
# select * from horlytemp where temp > 80;
We should see something like:
time | temp
---------------------+------
2018-05-15T17:00:00 | 81
2018-06-17T10:00:00 | 81
2018-06-17T11:00:00 | 84
2018-06-17T12:00:00 | 86
2018-06-17T13:00:00 | 86
Stop the docker containers by doing:
docker-compose down
- Hive-presto docker compose: https://github.com/big-data-europe/docker-hive
- Hadoop docker compose: https://github.com/big-data-europe/docker-hadoop