ardbramantyo/SQL-Vehicle-Service

databases etl etl-pipeline sql sql-server ssis

ETL Error Handling Vehicle Service

Scenario:

I, as a Data Scientist, was hired by a vehicle company to do some analysis for the company which services vehicles in a lot of locations. It’s a huge company with millions of customers and it does a service for them across lots of different countries and what they’re doing this time is they do some forecast for the year 2016. They give massive datasets(.csv), the data size is around 60 mb and consists up to 1.05M rows of customer data around the world. The original data looks like this:

The dataset consists these field:

• Customer ID

• Customer Since: the date since they become customer

• Vehicle: What type of vehicle he/she services

• 2014: customers revenue for 2014

• 2015: customers revenue for 2015

• 2016e: the projected revenue for 2016. This values created by extrapolating the model based on past 2 years revenue record.

The CustomerID field does not contain duplicated records and that information comes to you from the person who supplied this data. They’re saying that the dataset is well structured and it’s a guarantee from them and this can be an intrinsic knowledge. The job as a Data Scientist is to find errors inside this dataset. You know that the total projected revenue for 2016 equals: $419,896,187.87 and you need the uploaded data is match with this value.

About

The company which services vehicles in a lot of locations gives massive datasets and this project tries to conduct ETL handling to exclude some data errors such as duplicated values, misplaced, improper, and wrong type of data.Vehicle Service

databases etl etl-pipeline sql sql-server ssis