Matplotlib and Pandas Sampler

Lessons

These short guides are meant to show you some practical examples of matplotlib and pandas, not serve as comprehensive walkthroughs.

Datasets

The data folder contains several datasets, extracted and somewhat normalized for your convenience:

Climate data

data/climate
- Sources:
  - NASA-aggregated data on global temperature and greenhouse gases
data/schools
- Sources:
  - 2014 SAT scores for California schools
  - 2014 Free-and-reduced lunch (poverty) data for schools
data/stocks
- Source:
  - Daily closing prices for top tech stocks, via Yahoo Finance.
data/congress
- Sources:
  - Legislator spreadsheet from Sunlight Foundation.
  - Twitter API and t-tool

Ad-hoc examples (to get their own notebook)

Typecasting dates during the pandas import:

from os.path import join
import matplotlib.pyplot as plt 
import pandas as pd
fname = join('data', 'stocks', 'YHOO.csv')
# must specify that the 'Date' column is actually a date
# and pandas will try its best to convert it
df = pd.read_csv(fname, parse_dates=['Date'])
fig, ax = plt.subplots()
ax.plot(df['Date'], df['Adj Close'])

Without pandas, here's what that typecasting would look like:

from os.path import join
from datetime import datetime
import csv
fname = join('data', 'stocks', 'YHOO.csv')
with open(fname, 'r') as rf:  
    data = list(csv.DictReader(fname))
    for d in data:
        d['Date'] = datetime.strptime(d['Date'], '%Y-%m-%d')
        d['Adj Close'] = float(d['Adj Close'])
# then the visualization code...

Coercing numeric values with pandas

The 2014 SAT score data is an example of annoyingly difficult dirty data. The columns contain a mix of numbers and things like asterisks, which need to be cleared out if pandas is to typecast a column as all numbers/floats/etc.

The coercion can be done when read_csv() is called; check out the documentation for all of its arguments.

One argument is na_values, which let's us specify strings values that should be considered as "not-a-number" values. Such as 'NA' or '*':

Here's the import without specifying na_values:

from os.path import join
import pandas as pd
fname = join('data', 'schools', 'sat-2014.csv')
adf = pd.read_csv(fname)
bdf = pd.read_csv(fname, na_values=['*'])

Compare the dtypes attributes of adf and bdf -- many more columns of the bdf dataframe are typecasted as numbers.

Now it's easy to filter the SAT results by schools that have a minimum number of test takers:

cdf = bdf[bdf['number_of_test_takers'] >= 20]

About

Datasets and Matplotlib/pandas examples

Languages

Language:Jupyter Notebook 100.0%

datademofun / matplotlibsampler