In this lab, let's get some hands-on practice working with data cleanup using Pandas.
You will be able to:
- Use the
.map()
and.apply()
methods to apply a function to a pandas Series or DataFrame - Perform operations to change the structure of pandas DataFrames
- Change the index of a pandas DataFrame
- Change data types of columns in pandas DataFrames
Import the file 'turnstile_180901.txt'
.
# Import the required libraries
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# Import the file 'turnstile_180901.txt'
df = pd.read_csv('turnstile_180901.txt')
# Print the number of rows ans columns in df
print(df.shape)
# Print the first five rows of df
df.head()
Rename all the columns to lower case:
# Rename all the columns to lower case
Change the index to 'linename'
:
# Change the index to 'linename'
Reset the index:
# Reset the index
Create another column 'Num_Lines'
that is a count of how many lines pass through a station. Then sort your DataFrame by this column in descending order.
Hint: According to the data dictionary, LINENAME represents all train lines that can be boarded at a given station. Normally lines are represented by one character. For example, LINENAME 456NQR represents trains 4, 5, 6, N, Q, and R.
# Add a new 'num_lines' column
Write a function to clean column names:
def clean(col_name):
# Clean the column name in any way you want to. Hint: think back to str methods
cleaned = None
return cleaned
# Use the above function to clean the column names
# Check to ensure the column names were cleaned
df.columns
- Change the data type of the
'date'
column to a date - Add a new column
'day_of_week'
that represents the day of the week
# Convert the data type of the 'date' column to a date
# Add a new column 'day_of_week' that represents the day of the week
# Group the data by day of week and plot the sum of the numeric columns
grouped = df.groupby('day_of_week').sum()
grouped.plot(kind='barh')
plt.show()
- Remove the index of
grouped
- Print the first five rows of
grouped
# Reset the index of grouped
grouped = None
# Print the first five rows of grouped
Add a new column 'is_weekend'
that maps the 'day_of_week'
column using the dictionary weekend_map
# Use this dictionary to create a new column
weekend_map = {0:False, 1:False, 2:False, 3:False, 4:False, 5:True, 6:True}
# Add a new column 'is_weekend' that maps the 'day_of_week' column using weekend_map
grouped['is_weekend'] = grouped['day_of_week'].map(weekend_map)
# Group the data by weekend/weekday and plot the sum of the numeric columns
wkend = grouped.groupby('is_weekend').sum()
wkend[['entries', 'exits']].plot(kind='barh')
plt.show()
Remove the 'c/a'
and 'scp'
columns.
# Remove the 'c/a' and 'scp' columns
df = None
df.head(2)
What is misleading about the day of week and weekend/weekday charts you just plotted?
# Your answer here
Great! You practiced your data cleanup skills using Pandas.