In this section you've learned a lot about importing, cleaning up, analysing (using descriptive statistics) and visualizing data. In this more free form project you'll get a chance to practice all of these skills with the Boston Housing data set, which contains housing values in suburbs of Boston. The Boston Housing Data is commonly used by aspiring data scientists.
You will be able to:
- Show mastery of the content covered in this section
Use your data munging and visualization skills to conduct an exploratory analysis of the dataset below. At minimum, this should include:
- Loading the data (which is stored in the file train.csv)
- Use built-in python functions to explore measures of centrality and dispersion for at least 3 variables
- Create meaningful subsets of the data using selection operations using
.loc
,.iloc
or related operations. Explain why you used the chosen subsets and do this for 3 possible 2-way splits. State how you think the 2 measures of centrality and/or dispersion might be different for each subset of the data. Examples of potential splits:- Create a 2 new dataframes based on your existing data, where one contains all the properties next to the Charles river, and the other one contains properties that aren't.
- Create 2 new datagrames based on a certain split for crime rate.
- Next, use histograms and scatterplots to see whether you observe differences for the subsets of the data. Make sure to use subplots so it is easy to compare the relationships.
This data frame contains the following columns:
per capita crime rate by town.
proportion of residential land zoned for lots over 25,000 sq.ft.
proportion of non-retail business acres per town.
Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nitrogen oxides concentration (parts per 10 million).
average number of rooms per dwelling.
proportion of owner-occupied units built prior to 1940.
weighted mean of distances to five Boston employment centres.
index of accessibility to radial highways.
full-value property-tax rate per $10,000.
pupil-teacher ratio by town.
1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
lower status of the population (percent).
median value of owner-occupied homes in $10000s.
Source Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102.
Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.
Congratulations, you've completed your first "freeform" exploratory data analysis of a popular data set!