saraivaufc / bigdata-docker

Run Hadoop Cluster within Docker Containers.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

bigdata-docker

Install Docker (Ubuntu):

$ sudo apt-get remove docker docker-engine docker.io
$ sudo apt-get update
$ sudo apt-get install apt-transport-https ca-certificates  curl software-properties-common
$ sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
$ sudo apt-key fingerprint 0EBFCD88
$ sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
$ sudo apt-get update
$ sudo apt-get install docker-ce
$ sudo docker run hello-world

Install Docker Compose

$ sudo curl -L "https://github.com/docker/compose/releases/download/1.23.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
$ sudo chmod +x /usr/local/bin/docker-compose

To use GPU

Install NVIDIA Docker 2

$ sudo apt-get install -y nvidia-docker2 nvidia-container-toolkit

Update /etc/docker/daemon.json

From:

{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

To:

{
	"default-runtime":"nvidia",
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

Restart Docker

$ service docker restart

Build images

$ docker-compose build --parallel

Up containers via compose

$ docker-compose up -d

Applications

Application URL
Hadoop http://localhost:9870
Hadoop Cluster http://localhost:8088
Hadoop HDFS hdfs://localhost:9000
Hadoop WEBHDFS http://localhost:14000/webhdfs/v1
Hive Server2 http://localhost:10000
Hue http://localhost:8888 (username: hue,password: secret)
Spark Master UI http://localhost:4080
Spark Jobs http://localhost:4040
Livy http://localhost:8998
Jupyter notebook http://localhost:8899
AirFlow http://localhost:8080 (username: airflow,password: airflow)
Flower http://localhost:8555

Tutorials

HDFS

Access the Hadoop Namenode container

docker exec -it hadoop-master bash

List root content

hadoop fs -ls /

Create a directory structure

hadoop fs -mkdir /dados
hadoop fs -ls /
hadoop fs -ls /dados
hadoop fs -mkdir /dados/bigdata
hadoop fs -ls /dados

Test the deletion of a directory

hadoop fs -rm -r /dados/bigdata
hadoop fs -ls /dados

Add an external file to the cluster

cd /root
ls
hadoop fs -mkdir /dados/bigdata
hadoop fs -put /var/log/alternatives.log /dados/bigdata
hadoop fs -ls /dados/bigdata

Copy files

hadoop fs -ls /dados/bigdata
hadoop fs -cp /dados/bigdata/alternatives.log /dados/bigdata/alternatives2.log
hadoop fs -ls /dados/bigdata

List the contents of a file

hadoop fs -ls /dados/bigdata
hadoop fs -cat /dados/bigdata/alternatives.log

Create a HUE User

hadoop fs -mkdir /user/hue
hadoop fs -ls /user/hue
hadoop fs -chmod 777 /user/hue

Hive

Access the Hadoop Namenode container

docker exec -it hadoop-master bash

Run Hive Shell

hive

List databases

> show databases;

Access 'default' Database

> use default;

List database tables

> show tables;

Spark

Documentation: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=read%20csv

Data ingestion in HDFS

# Access the Hadoio Namenode container
docker exec -it hadoop-master bash

# Download ENEM datasets: http://inep.gov.br/microdados

# create spark folder in HDFS
hadoop fs -mkdir /user/spark/

# Data ingestion in HDFS
hadoop fs -put  MICRODADOS_ENEM_2018.csv /user/spark/
hadoop fs -put  MICRODADOS_ENEM_2017.csv /user/spark/

Access the Spark master node container

docker exec -it spark-master bash

Access Spark shell

spark-shell

Load ENEM 2018 data from HDFS

val df = spark.read.format("csv").option("sep", ";").option("inferSchema", "true").option("header", "true").load("hdfs://hadoop-master:9000/user/spark/MICRODADOS_ENEM_2018.csv")

Show dataframe schema

df.printSchema()

Show how many visually impaired students participated in the ENEM test in 2018.

df.groupBy("IN_CEGUEIRA").count().show()

Show how many students participated in the ENEM test in 2018 grouped by age.

df.groupBy("NU_IDADE").count().sort(asc("NU_IDADE")).show(100, false)

Kafka

Connect Kafka Broker 1

docker exec -it kafka-broker1 bash

Create topic

kafka-topics.sh --create --zookeeper zookeeper:2181 --replication-factor 1 --partitions 1 --topic test

List topics

kafka-topics.sh --zookeeper zookeeper:2181 --list

Run Producer on Kafka Broker 1

kafka-console-producer.sh --bootstrap-server kafka-broker1:9091 --topic test

Enter data

>Hello

Connect Kafka Broker 2

docker exec -it kafka-broker2 bash

Run Consumer on Kafka Broker 2

kafka-console-consumer.sh --bootstrap-server kafka-broker1:9091 --from-beginning --topic test

Delete topic

kafka-topics.sh --zookeeper zookeeper:2181 --delete --topic test
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

About

Run Hadoop Cluster within Docker Containers.

License:Other


Languages

Language:Shell 65.0%Language:Dockerfile 35.0%