ramyananth / Data-Wrangling-using-Pandas-and-Regex-

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

In this project I have implemented and performed a unit testing for a series of Python functions (Q1-Q13) that are typically required during the data wrangling phase of the end-to-end data science pipeline.

Data Wrangling consists of the following main steps:

Data Acquisition, Data Cleansing, Data Understanding: Basics, Data Manipulation

  1. Data Acquisition Objectives

Question 1: How to import multiples files for storage and access? (store filenames in array)

Question 2: How to import data in different formats? (read_excel, read_csv)

Question 3: How are they read into by pandas? (DataFrame)

Question 4: How to have a peek at the data after import? (head/tail)

  1. Data Cleansing Objectives

Question 5: Check attributes of each file

Question 5: Identify data types

Question 5: Apply coercion if applicable

Question 5: Check for NA/missing data

Question 6: Remove/replace corrupt data

Question 6: Identify duplicate data

Question 6: Check for corrupt/incorrect data

Check for data consistency (e.g. GPA cannot be less than 0)

Identifying and removing outliers

  1. Data Understanding Objectives

Question 7: Basic Summary Statistics

Question 9: Dimensionality

  1. Data Manipulation Objectives

Question 11: Merge/Concatenate DataFrame

Question 11: Mapping to create a new attribute

Question 11: Incorporate the use of multiple functions

Question 12: Filter to subset the data

Question 13: Discretize data

Regular Expressions: Regular expressions are used in conjunction with other preprocessing steps for matching/parsing patterns.

Questions 2/5/6: Filter to subset the dataUse regular expressions to find/match specific content

Question 6: Filter to subset the dataString manipulation via. substring and replace methods

About


Languages

Language:Jupyter Notebook 100.0%