In this project I have implemented and performed a unit testing for a series of Python functions (Q1-Q13) that are typically required during the data wrangling phase of the end-to-end data science pipeline.
Data Wrangling consists of the following main steps:
Data Acquisition, Data Cleansing, Data Understanding: Basics, Data Manipulation
- Data Acquisition Objectives
Question 1: How to import multiples files for storage and access? (store filenames in array)
Question 2: How to import data in different formats? (read_excel, read_csv)
Question 3: How are they read into by pandas? (DataFrame)
Question 4: How to have a peek at the data after import? (head/tail)
- Data Cleansing Objectives
Question 5: Check attributes of each file
Question 5: Identify data types
Question 5: Apply coercion if applicable
Question 5: Check for NA/missing data
Question 6: Remove/replace corrupt data
Question 6: Identify duplicate data
Question 6: Check for corrupt/incorrect data
Check for data consistency (e.g. GPA cannot be less than 0)
Identifying and removing outliers
- Data Understanding Objectives
Question 7: Basic Summary Statistics
Question 9: Dimensionality
- Data Manipulation Objectives
Question 11: Merge/Concatenate DataFrame
Question 11: Mapping to create a new attribute
Question 11: Incorporate the use of multiple functions
Question 12: Filter to subset the data
Question 13: Discretize data
Regular Expressions: Regular expressions are used in conjunction with other preprocessing steps for matching/parsing patterns.
Questions 2/5/6: Filter to subset the dataUse regular expressions to find/match specific content
Question 6: Filter to subset the dataString manipulation via. substring and replace methods