Stefen-Taime / ETL-Data-Pipeline-RDBMS-TO-HDFS-using-Airflow-Apache-Sqoop-Spark-Postgres-and-Hive

This project aims to move the data from a Relational database system (RDBMS) to a Hadoop file system (HDFS)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

img

docker-compose up -d

Then you can access the Airflow UI webserver through port 8080

Please feel free to turn the dag button on for the hands_on_test. It sets a start_date to days_ago(1) and schedule to run on a daily basis.

Assume that the pipeline is run completely. You can test the result on the following components like this.

HIVE

# show tables
docker exec hive-server beeline -u jdbc:hive2://localhost:10000/default -e "SHOW TABLES;"
# describe table
docker exec hive-server beeline -u jdbc:hive2://localhost:10000/default -e "SHOW CREATE TABLE <<TARGET TABLE>>;"

# change <<TARGET TABLE>> to your table name e.g., 'order_detail', 'restaurant_detail'
# sample data
docker exec hive-server beeline -u jdbc:hive2://localhost:10000/default -e "SELECT * FROM <<TARGET TABLE>> LIMIT 5;"

# change <<TARGET TABLE>> to your table name e.g., 'order_detail', 'restaurant_detail'
# check partitioned parquet
docker exec hive-server hdfs dfs -ls /user/spark/transformed_order_detail
docker exec hive-server hdfs dfs -ls /user/spark/transformed_restaurant_detail

# check the source of external table in ./airflow/scripts/hql script.

For SQL requirement files, the CSV files will be placed in the ./sql_result when the dag is completed.

After you finish the test, you can close the whole application by

docker-compose down -v

Medium Post : https://medium.com/@stefentaime/etl-data-pipeline-rdbms-to-hdfs-using-airflow-apache-sqoop-spark-postgres-and-hive-773f0e745537

Original github https://github.com/Pathairush/rdbms_to_hdfs_data_pipeline

About

This project aims to move the data from a Relational database system (RDBMS) to a Hadoop file system (HDFS)


Languages

Language:Python 70.1%Language:Shell 29.9%