vijaysuryaw / HiveDockerSetup

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Docker multi-container environment with Hadoop, Spark and Hive

This is it: a Docker multi-container environment with Hadoop (HDFS), Spark and Hive. But without the large memory requirements of a Cloudera sandbox. (On my Windows 10 laptop (with WSL2) it seems to consume a mere 3 GB.)

The only thing lacking, is that Hive server doesn't start automatically. To be added when I understand how to do that in docker-compose.

Quick Start

To deploy an the HDFS-Spark-Hive cluster, run:

  docker-compose up

Important note regarding Docker Desktop

Since Docker Desktop turned “Expose daemon on tcp://localhost:2375 without TLS” off by default there have been all kinds of connection problems running the complete docker-compose. Turning this option on again (Settings > General > Expose daemon on tcp://localhost:2375 without TLS) makes it all work. I’m still looking for a more secure solution to this.

Quick Start HDFS

Copy breweries.csv to the namenode.

docker cp breweries.csv namenode:breweries.csv

Go to the bash shell on the namenode with that same Container ID of the namenode.

docker exec -it namenode bash

Create a HDFS directory /data//openbeer/breweries.

hdfs dfs -mkdir -p /data/openbeer/breweries

Copy breweries.csv to HDFS:

hdfs dfs -put breweries.csv /data/openbeer/breweries/breweries.csv

Quick Start Hive

Open another powershell terminal and execute below command

  docker exec -it hive-server bash

Start the hiveserver2:

  hiveserver2

Maybe a little check that something is listening on port 10000 now

netstat -anp | grep 10000

Okay. Beeline is the command line interface with Hive. Let's connect to hiveserver2 now.

beeline -u jdbc:hive2://localhost:10000 -n root

Didn't expect to encounter scott/tiger again after my Oracle days. But there you have it. Definitely not a good idea to keep that user on production.

Not a lot of databases here yet.

show databases;

Let's change that.

create database openbeer;
  use openbeer;

And let's create a table.

CREATE EXTERNAL TABLE IF NOT EXISTS breweries(
    NUM INT,
    NAME CHAR(100),
    CITY CHAR(100),
    STATE CHAR(100),
    ID INT )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
location '/data/openbeer/breweries';

And have a little select statement going.

  select * from breweries;

There you go: your private Hive server to play with.

About


Languages

Language:Shell 68.0%Language:Dockerfile 21.8%Language:Makefile 5.5%Language:CSS 4.0%Language:Scala 0.6%