Date Warehousing Pipeline of SCD Type 01 Logic

Client Requirement

Daily a file will be coming from the client side about the customer purchase data of file type CSV
There will be new records every day and there might also be old records that need to be updated
Client requires SCD TYPE 01 logic to be in the warehouse
Also at end of the processing of each day there data need to be reconciled

Data ingestion

Data is loaded into MYSQL DBMS using command prompt loading
The data after some pre-processing then ingested to HDFS using sqoop

Date Summarisation and Warehousing

Hive is used to manage the Warehousing part
Implemented SCD TYPE 02 logic
Implemented Partitioning on Year & Month for fast retrieval

Validation

Once the pipeline is completed the data is at last checked with input records for the count
After every successful operation or failure the log is generated and can be seen for the report and analysis

Script

Entire warehousing solution is automated using bash scripts
All credentails,output direotries, DBMS details are made dynamic using parmeter file and credential files

Execution

Project Work Flow

Datasets Folder here

The place where daily data comes in and there are some sub folders which is used for testing while coding dev

Scripts here

Entire automation is done here, you can find the entire logic here and intermediate files generated

Env here

Support files required for the Script file are kept under this directory

Reference File here

Under this direcotry you can find the details regarding the column description at schema level

About

In this project i have implemented the hadoop pipeline using sqoop for ingestion,hive for sumaarising and implementing the warehosue logics and MYSQL as an DB for validationa and storage.The entire thing was automated using the script and with help of bash commands we made it each and every incident is logged properly

Languages

Language:Java 95.4%Language:Shell 3.0%Language:HiveQL 1.6%