big-data docker docker-hadoop docker-spark hadoop hadoop-docker pyspark spark big-data-docker spark-docker template hadoop-template big-data-template spark-template

Hadoop Spark Cluster - Analyzing big data

Project Description

This project aims to develop a versatile data pipeline capable of processing datasets large in size, utilizing Docker containers and Hadoop + Spark.

About

For our practical implementation, we selected the May 2015 Reddit Comments Dataset available on Kaggle. However, the pipeline's flexibility allows for the incorporation of various datasets. This adaptability is achieved by adjusting the NAMENODE_DATA_DIR variable in the ./hadoop-spark-cluster/Makefile and setting the namenode HDFS URL in scripts/spark/config.json.

Leveraging Apache Spark for data processing and HDFS on a Hadoop cluster for data storage, each node operates within its own container, ensuring efficient data handling.

The pipeline is designed to generate an output.csv file (prior to uploading it in parts as Parquet parts to the virtual HDFS container), located in the /data directory at the project's root. Should you opt to use the SQLite database from the provided link, a handy conversion script scripts/utils/csv_converter.py is available to convert the data from SQLite to CSV format before running the initialization script.

Prerequisites

Important

It is absolutely necessary that the output.csv file is present in the /data directory before running the initialization script. The file is not included in the repository due to its size.
A schema.json file under the /scripts/spark directory is required for the Spark job to run. The schema.json file should contain the schema of the output.csv file in JSON format. The schema should contain types defined in the PySpark Structfield Documentation.

Pipenv (for installing dependencies)
Docker
Docker Compose

How to run

Create a python virtual environment
```
pipenv install
```
Source the virtual environment
```
pipenv shell
```
Run the 'init.sh' script, which moves the output.csv file to HDFS as Parquet parts
```
chmod +x init.sh
./init.sh
```

Authors

About

A Spark/Hadoop-Docker Cluster template for working with Big Data

big-data docker docker-hadoop docker-spark hadoop hadoop-docker pyspark spark big-data-docker spark-docker template hadoop-template big-data-template spark-template

Languages

Language:Python 43.3%Language:Shell 32.1%Language:Dockerfile 16.1%Language:Makefile 8.6%