I, as a Data Scientist, was hired by a vehicle company to do some analysis for the company which services vehicles in a lot of locations. It’s a huge company with millions of customers and it does a service for them across lots of different countries and what they’re doing this time is they do some forecast for the year 2016. They give massive datasets(.csv), the data size is around 60 mb and consists up to 1.05M rows of customer data around the world. The original data looks like this:
The dataset consists these field:
• 2016e: the projected revenue for 2016. This values created by extrapolating the model based on past 2 years revenue record.
The CustomerID field does not contain duplicated records and that information comes to you from the person who supplied this data. They’re saying that the dataset is well structured and it’s a guarantee from them and this can be an intrinsic knowledge. The job as a Data Scientist is to find errors inside this dataset. You know that the total projected revenue for 2016 equals: $419,896,187.87 and you need the uploaded data is match with this value.