Learn Hadoop Using Docker

This is just for learning intention, not recomended for production. Use Hortonworks or Cloudera for production instead or you can just setup using cloud service. In GCP there is Dataproc or in AWS there is EMR (Elastic Map Reduce).

Download binary

This Dockerfile build from existing hadoop image but, if you want to download the binary first instead build your own Dockerfile image : https://archive.apache.org/dist/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
Download Apache Spark Binary https://downloads.apache.org/spark/spark-3.3.6/

Kick-off Cluster

Clone this repos to your project directory, and cd hadoop-docker
If you are in Linux/MAC Just simply run ./run_cluster.sh
If you are in Windows try run this command docker build -t hadoop-base:3.3.6 . && docker-compose up

How to run MapReduce Job

There is ratings_breakdown.py python file in map_reduce directory, we can run this file on a local python mode or in Hadoop world
For python mode local run command python3 map_reduce/ratings_breakdown.py input/u.data
To run this file in Hadoop run this command python3 map_reduce/ratings_breakdown.py -r hadoop --hadoop-streaming-jar /opt/hadoop-3.3.6/share/hadoop/tools/lib/hadoop-streaming-3.3.6.jar input/u.data
Path /opt/hadoop-3.3.6/share/hadoop/tools/lib/hadoop-streaming-3.3.6.jar might be different for users or if you are using different version of Hadoop, run command find / hadoop-streaming-$(HADOOP_VERSION).jar to find it

Running with hadoop command

Make sure you have input directory in HDFS, if it's not exist just run this command hadoop fs -mkdir -p input
Put your data in input directory in your local project hdfs dfs -put ./input/* input
And ready to run

hadoop jar /opt/hadoop-3.3.6/share/hadoop/tools/lib/hadoop-streaming-3.3.6.jar -file /hadoop-data/map_reduce/word_count/mapper.py -mapper "python3 mapper.py" -file /hadoop-data/map_reduce/word_count/reducer.py -reducer "python3 reducer.py" -input input/words.txt -output output_word_count

Hadoop Configurations

core-site.xml default and description here
hdfs-site.xml default and description here
mapred-site.xml default and description here
yarn-site.xml default and description here
To calculate YARN and MapReduce memory configuration here

End to end step by step to kick-off Airflow, Spark Cluster and Hadoop Cluster

Please start everything from run_cluster.sh because the base image created from this step

To Do

Add Hive configuration files
- hive-site.xml
- beeline-log4j2.properties
- hive-exec-log4j2.properties
- hive-log4j2.properties
- llap-daemon-log4j2.properties
- ivysettings.xml
- hive-env.sh

aanno / hadoop-docker