datademofun / matplotlibsampler

Datasets and Matplotlib/pandas examples

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Matplotlib and Pandas Sampler

Lessons

These short guides are meant to show you some practical examples of matplotlib and pandas, not serve as comprehensive walkthroughs.

Further reading

Matplotlib

(Note: While the Matplotlib homepage is a place you eventually want to go to, some of the documentation may be more complicated for you than necessary...)

Pandas

Datasets

The data folder contains several datasets, extracted and somewhat normalized for your convenience:

Climate data

Ad-hoc examples (to get their own notebook)

Typecasting dates during the pandas import:

from os.path import join
import matplotlib.pyplot as plt 
import pandas as pd
fname = join('data', 'stocks', 'YHOO.csv')
# must specify that the 'Date' column is actually a date
# and pandas will try its best to convert it
df = pd.read_csv(fname, parse_dates=['Date'])
fig, ax = plt.subplots()
ax.plot(df['Date'], df['Adj Close'])

Without pandas, here's what that typecasting would look like:

from os.path import join
from datetime import datetime
import csv
fname = join('data', 'stocks', 'YHOO.csv')
with open(fname, 'r') as rf:  
    data = list(csv.DictReader(fname))
    for d in data:
        d['Date'] = datetime.strptime(d['Date'], '%Y-%m-%d')
        d['Adj Close'] = float(d['Adj Close'])
# then the visualization code...

Coercing numeric values with pandas

The 2014 SAT score data is an example of annoyingly difficult dirty data. The columns contain a mix of numbers and things like asterisks, which need to be cleared out if pandas is to typecast a column as all numbers/floats/etc.

The coercion can be done when read_csv() is called; check out the documentation for all of its arguments.

One argument is na_values, which let's us specify strings values that should be considered as "not-a-number" values. Such as 'NA' or '*':

Here's the import without specifying na_values:

from os.path import join
import pandas as pd
fname = join('data', 'schools', 'sat-2014.csv')
adf = pd.read_csv(fname)
bdf = pd.read_csv(fname, na_values=['*'])

Compare the dtypes attributes of adf and bdf -- many more columns of the bdf dataframe are typecasted as numbers.

Now it's easy to filter the SAT results by schools that have a minimum number of test takers:

cdf = bdf[bdf['number_of_test_takers'] >= 20]

About

Datasets and Matplotlib/pandas examples


Languages

Language:Jupyter Notebook 100.0%