Exploration and Visualization of Data with Python and libraries like matplotlib and seaborn

Short Link to repo - bit.ly/pydelhi_eda

Hands-on-Session presented at PyDelhi Meetup, September 2018 and PyData Meetup December 2018

This Jupyter notebook introduces you to some basic principles of data exploration and visualization using Python language along with the libraries like Matplotlib and Seaborn. You will learn different methods for exploration of data using visualization techniques. We will use several Python packages like matplotlib, Pandas plotting, and seaborn to create the visualizations.

About this Jupyter Notebook

To run this notebook you need to install necessary packages, listed down. If you have not done so, you will need to install them first, as these are not in the Anaconda distribution as of now. From a command prompt on your computer type the following command. If no error occurs, you will have installed them.

pip install seaborn pip install pandas pip install matplotlib

How to get started?

Fork the repository to run the jupyter notebook on your own computer.

Pre-requisite

A bit of experience with Python, Pandas and Jupyter Notebook is sufficient. If you are a beginner then you can follow along with:

Python-QuickNotes

About the dataset

The datasets used for exploration is Pokemon Dataset.
This dataset contains information on all 802 Pokemon from all Seven Generations of Pokemon. The information contained in this dataset include:

Features	Description
name	The English name of the Pokemon
japanese_name	The Original Japanese name of the Pokemon
pokedex_number	The entry number of the Pokemon in the National Pokedex
percentage_male	The percentage of the species that are male. Blank if the Pokemon is genderless
type1	The Primary Type of the Pokemon
type2	The Secondary Type of the Pokemon
classification	The Classification of the Pokemon as described by the Sun and Moon Pokedex
height_m	Height of the Pokemon in metres
weight_kg	The Weight of the Pokemon in kilograms
capture_rate	Capture Rate of the Pokemon
base_egg_steps	The number of steps required to hatch an egg of the Pokemon
abilities	A stringified list of abilities that the Pokemon is capable of having
experience_growth	The Experience Growth of the Pokemon
base_happiness	Base Happiness of the Pokemon
against_?	Eighteen features that denote the amount of damage taken against an attack of a particular type
hp	The Base HP of the Pokemon
attack	The Base Attack of the Pokemon
defense	The Base Defense of the Pokemon
sp_attack	The Base Special Attack of the Pokemon
sp_defense	The Base Special Defense of the Pokemon
speed	The Base Speed of the Pokemon
generation	The numbered generation which the Pokemon was first introduced
is_legendary	Denotes if the Pokemon is legendary

You can download the dataset from Kaggle

Why visualization?

“Visualization gives you answers to questions you didn’t know you had.” – Ben Schneiderman

Visualization is an essential method in any data scientist's toolbox. Visualization is a key first step in the exploration of most datasets. These process of exploring data visually and with simple summary statistics is known as Exploratory Data Analysis(EDA). As a general rule, you should never start creating models until you have an understanding of the relationships in your data. Visualization is also a powerful tool for presentation of results and for determining sources of problems with analytics.

The concept of exploring a dataset visually were pioneered by John Tukey in the 1960s and 1970s.

The key concept of exploratory data analysis(EDA) or visual exploration of data is to understand the relationship in the dataset. Specially using visualization when you approach a new dataset you can:

Explore complex datasets, using visualization to develop understanding of the inherent relationships.
Use different chart types to create multiple views of data to highlight differnt aspects of the inherent relationships.
Use plot aesthetics to project multiple dimensions.
Apply conditioning methods to project multiple dimensions.

In these exercises, you will use Pandas plotting, Matplotlib and the Seaborn packages. We assume you have atleast a bit of experience using Pandas and Jupyter notebooks.

Basic chart types

There are enumerable chart types that are used for data exploration. Some of them are explained below

Scatter plot : Scatter plots show the relationship between two variables in the form of dots on the plot. In simple terms, the value along a horizontal axis are plotted against a vertical axis.
Line plot : Line plots are similar to point plots. In line plots the discrete points are connected by lines.
Bar plot : Bar plots are used to display the counts of unique values of a categorical variable. The height of the bar represents the count for each unique category of the variable.
Histogram : Histograms are related to bar plots. Histograms are used for numeric variables. Whereas, a bar plot shows the counts of unique categories, a histogram shows the number of data with values within a bin. The bin divide the values of the variable into equal segments. The vertical axis of the histogram shows the count of data values within each bin.
Box plot : Box plots, also known as box and wisker plots, were introduced by John Tukey in 1970. Box plots are another way to visualize the distribution of data values. In this respect, box plots are comparable to histograms, but are quite different in presentation. On a box plot the median value is shown with a dark bar. The inner two quartiles of data values are contained within the 'box'. The 'wiskers' enclose the majority of the data(up to +/-2.5 * interquartile range). Outliers are shown by symbols beyond the wiskers. Several box plots can be stacked along an axis for comparison. The data are divided using a 'group by' operation, and the box plots for each group are attached next to each other. In this way, the box plot allows you to display two dimensions of your dataset.
Kernel Density Estimation Plots(KDE) : Kernel density plots are similar in concept to a histogram. A kernel density plot displays the values of a smoothed density curve of the data values. In other words, the kernel density plot is a smoothed version of a histogram.
Violin plot : A violin plot combines attributes of boxplots and a kernel density estimation plot. Like a box plot, the violin plots can be stacked, with a 'group by' operation. Additionally, the violin plot provides a kernel density estimate for each group. As with the box plot, violin plots allow you to display two dimensions of your dataset.

About the speaker

These lessons are prepared by Praneet Nigam. He is currently working as a Machine Learning Facilitator for the Google Machine Learning Crash Course. For being in touch with the speaker, contact him on listed down social media links.

Some of the past projects of Praneet Nigam

Support

You can buy me a cup of coffee. Even a small contribution helps a lot in a long way. Please Donate Here

Resources

In this tutorial we will work with powerful Python packages like Pandas, Matplotlib and Seaborn. These packages have extensive online documentation. There is an extensive tutorial on Visualization with Pandas. The Seaborn tutorial contains many examples of data visualization. The matplotlib website has addition resources for learning plotting with Python tools.

prmohanty / DataVisualizationPyDelhi