SUBHASH-S-M / Project_1-_Loading_Online_Event_Hits_using_Sqoop_to_Hive_via_Shell_Script

In this project i have implemented the hadoop pipeline using sqoop for ingestion,hive for sumaarising and implementing the warehosue logics and MYSQL as an DB for validationa and storage.The entire thing was automated using the script and with help of bash commands we made it each and every incident is logged properly

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Date Warehousing Pipeline of SCD Type 01 Logic

Client Requirement

  • Daily a file will be coming from the client side about the customer purchase data of file type CSV

  • There will be new records every day and there might also be old records that need to be updated

  • Client requires SCD TYPE 01 logic to be in the warehouse

  • Also at end of the processing of each day there data need to be reconciled

Data ingestion

  • Data is loaded into MYSQL DBMS using command prompt loading

  • The data after some pre-processing then ingested to HDFS using sqoop

Date Summarisation and Warehousing

  • Hive is used to manage the Warehousing part

  • Implemented SCD TYPE 02 logic

  • Implemented Partitioning on Year & Month for fast retrieval

Validation

  • Once the pipeline is completed the data is at last checked with input records for the count

  • After every successful operation or failure the log is generated and can be seen for the report and analysis

Script

  • Entire warehousing solution is automated using bash scripts

  • All credentails,output direotries, DBMS details are made dynamic using parmeter file and credential files

Execution

Project Work Flow

Alt text

Datasets Folder here

  • The place where daily data comes in and there are some sub folders which is used for testing while coding dev

Scripts here

  • Entire automation is done here, you can find the entire logic here and intermediate files generated

Env here

  • Support files required for the Script file are kept under this directory

Reference File here

  • Under this direcotry you can find the details regarding the column description at schema level

About

In this project i have implemented the hadoop pipeline using sqoop for ingestion,hive for sumaarising and implementing the warehosue logics and MYSQL as an DB for validationa and storage.The entire thing was automated using the script and with help of bash commands we made it each and every incident is logged properly


Languages

Language:Java 95.4%Language:Shell 3.0%Language:HiveQL 1.6%