Edator

This is a python package that performs exploratory data analysis for users. It takes in a csv file and generates 3 documents that comprise of a text report containing a descriptive summary, a series of plots and a cleaned csv output.

Set up

Dependencies

Python 3.8x
matplotlib==3.1.2
numpy==1.18.1
pandas==1.0.0
PySimpleGUI==4.19.0
scikit-learn==0.22.1
scipy==1.4.1
seaborn==0.10.0
statsmodels==0.11.1
more-itertools==8.3.0

How to set up? (Important!)

You can clone or download my package.
Using terminal, move to the directory.
- Example for Mac OS users:
```
$ cd Downloads/Edator
```
Install the required packages using:
```
pip install -r requirements.txt
```
After that, change directory into the Script folder using:
```
$ cd Script
```
Now, execute the main.py file by:
```
$ python main.py
```
You should see the following:
Choose the csv file, the path to export the plots, the report and the cleaned csv file to.
Done!

The concept behind Edator

Dealing with NaN values and zeros

How I deal with NaN value is that I only remove the affected rows when the percentage of NaN within that column is less than 5%. This applies to both numerical and categorical values. For anything above 5%, I replace the NaN values with median. For categorical values, the NaN values will be replace by mode.

Dealing with zeros is much harder as it is challenging to differentiate between a zero that is meaningful (has a purpose and should not be removed) and a zero that serves no purpose and can potentially add more noise to the dataset. Hence, I decided to inform the user about the percentage of zeros in the dataset.

Processing outliers

I use Z-score to detect outliers. If a Z-score is 0, it indicates that the data point’s score is identical to the mean score. A Z-score of 1.0 would indicate a value that is one standard deviation from the mean. Z-scores may be positive or negative, with a positive value indicating the score is above the mean and a negative score indicating it is below the mean.

In most cases, a threshold of 3 or -3 is used to filter off outliers and I have used this approach for all of my analysis.

Correlation

For correlation, I included:

Pearson and Spearman correlation for numerical-numerical variables.
One Way ANOVA for numerical-categorical variables
Chi-Square test for categorical-categorical variables

Using itertools.combinations, I identify every possible combinations among numerical-numerical variables, numerical-categorical variables and categorical-categorical variables. I then apply the correlation test based on the criteria I have set above.

Plots

For plots, I created:

Scatterplot for numerical variables
Countplot for categorical variables
Boxplot for numerical-categorical variables

Similar to correlation, I used itertools.combinations to create every possible plot. I have also added the hue feature to each scatterplot. I will only do so when the categorical variable has less than 5 unique values. Example, if hue = "fruits", I should only see 4 types of fruits.

Upcoming changes

Upon obtaining sufficient feedback on this script, I will register this package in PyPI to streamline installation.
Instead of generating txt reports, I will utilise HTML and Bootstrap to generate a much more appeasing look.

GreggRoll / csv-EDA