๐ณ Data Warehouse Credit Card Applicant ๐ณ
using Pentaho Data Integration (PDI)/Kettle and Microsoft SQL Server 18 โ
.: ๐ Dataset taken from Kaggle :.
๐ Table of Contents:
- About Project
- Objectives
- Data Set Description
- Connection Configuration
- ETL Process
- Star Schema
- Before & After ETL Comparison
๐ About Project
-
This repository contains files to create data warehouse such as:
- ETL files using Pentaho Data Integration (PDI)
- Codes to create OLAP (SQL)
- Codes to select data from OLTP (SQL)
- Codes to perform random testing (SQL)
for credit card applicant. The dataset is provided by Seanny (rikdifos).
-
This project will also create:
- 2 dimension tables (Applicant_Dimension and CreditRecord_Dimension),
- Time dimension (Time_Dimension), and
- 1 fact table (CreditCard_Fact).
using PDI and Microsoft SQL Server 18.
๐ Objectives
- Perform ETL using PDI for both datasets.
- Create time dimension using PDI.
- Create fact table using PDI.
๐งพ Data Set Description
- The dataset description can be seen here.
๐ Connection Configuration
username: sa
pass: qwer
๐๐ OLTP Configuration
๐ฟ๐ OLAP Configuration
โ ETL Process
๐จโ๐ผ Application Record
โถ Table Input Configuration
- Importing application table from OLTP.
โถ Sort Rows Configuration
- Sort data based on applicant ID.
โถ Unique Rows Configuration
- Filter duplicate applicant ID.
โถ Replace in String Configuration
- Replace some values to make it easier to understand.
โถ Add Constants Configuration
- Add new columns with constant date (October 1, 2021).
โถ Calculator Configuration
- Calculate DOB and date of applicant start working based on current date (October 1, 2021).
- Calculate age of applicant based on current year (2021).
โถ Filter Rows Configuration
- Filter applicant data which has null values.
- Filter applicant data who is less than 21 y.o.
โถ Add Sequence Configuration
- Adding Index Applicant (to replace ID as primary key).
โถ Select Values Configuration
- Select columns that will entered OLAP.
โถ Table Output Configuration
- Exporting application table to OLAP (Application Dimension).
๐ถ Credit Record
โถ Table Input Configuration
- Importing credit record table from OLTP.
โถ Sort Rows Configuration
- Sort data based on applicant ID.
โถ Add Constants Configuration
- Add new columns with constant date (October 1, 2021).
โถ Calculator Configuration
- Calculate loan payment's month based on current date (October 1, 2021).
โถ Add Sequence Configuration
- Adding CreditRecord_ID (to replace Applicant ID as primary key).
โถ Select Values Configuration
- Select columns that will entered OLAP.
โถ Table Output Configuration
- Exporting application table to OLAP (Credit Record Dimension).
โ Time Dimension
โถ Generate Rows Configuration
- Generate a column with specific date (January 1, 2016).
โถ Add Sequence Configuration
- Add row with sequence from 1 to 99999.
โถ Calculator Configuration
- Caluclating start date with sequence data to make next date (ex: January 2, 2016; January 3, 2016)
- Creating new columns (Day, Months, and Year).
โถ Data Grid Configuration
- Creating month number and month name.
โถ Stream Lookup Configuration
- Combine 'Month' from Calculator node to 'No_Month' from Data Grid node.
โถ Modified JavaScript Value Configuration
- Creating time ID using JavaScript code.
โถ Select Values Configuration
- Select columns that will entered OLAP.
โถ Table Output Configuration
- Exporting time dimension to OLAP.
๐ณ Credit Card Fact
โถ Table Input (Credit Record) Configuration
- Importing Credit Record dimension from OLAP.
โถ Table Input (Application) Configuration
- Importing Application dimension from OLAP.
โถ Stream Lookup 1 Configuration
- Join both dimension tables based on applicant ID.
โถ Filter Rows Configuration
- Filter applicant ID that doesn't exists in both tables.
โถ Table Input (Time) Configuration
- Importing Time dimension from OLAP.
โถ Stream Lookup 2 Configuration
- Join application & credit record dimension with time dimension.
โถ Replace in String 1 Configuration
- Replace C, X, 0 with 'Good Debt' (C: loan for that month is already paid; X: no loan for that month; 0: loan is 1 to 29 days overdue).
- Replace 1, 2, 3, 4, 5 with 'Bad Debt' (1: loan is 30 to 59 days overdue; 2: loan is 60 to 89 days overdue; 3: loan is 90 to 119 days overdue; 4: loan is 120 to 149 days overdue; 5: loan is more than 150 days overdue)
โถ Calculator Configuration
- Creating 2 copies from 'Status' column ('Good_Debt' and 'Bad_Debt').
โถ Replace in String Configuration
- Good_Debt: Good Debt will be change to 1, while Bad Debt will be change to 0
- Bad_Debt: Good Debt will be change to 0, while Bad Debt will be change to 1
โถ Get System Info Configuration
- To create date & time when ETL was performed.
โถ Select Values Configuration
- Select columns that will entered OLAP.
โถ Table Output Configuration
- Exporting fact table to OLAP.
โญ Star Schema
๐ Before & After ETL Comparison
- This section will show the data structure before & after ETL.
๐จโ๐ผ Application Record
๐ถ Credit Record
โ Time Dimension
๐ณ Credit Card Fact
๐ Support me!
๐ If you find this project useful, please โญ this repository ๐!
๐ More about myself: here