In this lab, you'll apply regression analysis using simple matrix manipulations to fit a model to given data, and then predict new values for previously unseen data. You'll follow the approach highlighted in the previous lesson where you used NumPy to build the appropriate matrices and vectors and solve for the
In this lab you will:
- Use matrix algebra to calculate the parameter values of a linear regression
First, let's import necessary libraries:
import csv # for reading csv file
import numpy as np
The dataset you'll use for this experiment is "Sales Prices in the City of Windsor, Canada", something very similar to the Boston Housing dataset. This dataset contains a number of input (independent) variables, including area, number of bedrooms/bathrooms, facilities(AC/garage), etc. and an output (dependent) variable, price. You'll formulate a linear algebra problem to find linear mappings from input features using the equation provided in the previous lesson.
This will allow you to find a relationship between house features and house price for the given data, allowing you to find unknown prices for houses, given the input features.
A description of the dataset and included features is available here.
In your repository, the dataset is available as windsor_housing.csv
. There are 11 input features (first 11 columns):
lotsize bedrooms bathrms stories driveway recroom fullbase gashw airco garagepl prefarea
and 1 output feature i.e. price (12th column).
The focus of this lab is not really answering a preset analytical question, but to learn how you can perform a regression experiment, using mathematical manipulations - similar to the one you performed using statsmodels
. So you won't be using any pandas
or statsmodels
goodness here. The key objectives here are to:
- Understand regression with matrix algebra and
- Mastery in NumPy scientific computation
Let's give you a head start by importing the dataset. You'll perform the following steps to get the data ready for analysis:
-
Initialize an empty list
data
for loading data -
Read the csv file containing complete (raw)
windsor_housing.csv
. Usecsv.reader()
for loading data.. Store this indata
one row at a time -
Drop the first row of csv file as it contains the names of variables (header) which won't be used during analysis (keeping this will cause errors as it contains text values)
-
Append a column of all 1s to the data (bias) as the first column
-
Convert
data
to a NumPy array and inspect first few rows
NOTE:
read.csv()
reads the csv as a text file, so you should convert the contents to float.
# Your code here
# First 5 rows of raw data
# array([[1.00e+00, 5.85e+03, 3.00e+00, 1.00e+00, 2.00e+00, 1.00e+00,
# 0.00e+00, 1.00e+00, 0.00e+00, 0.00e+00, 1.00e+00, 0.00e+00,
# 4.20e+04],
# [1.00e+00, 4.00e+03, 2.00e+00, 1.00e+00, 1.00e+00, 1.00e+00,
# 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00,
# 3.85e+04],
# [1.00e+00, 3.06e+03, 3.00e+00, 1.00e+00, 1.00e+00, 1.00e+00,
# 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00,
# 4.95e+04],
# [1.00e+00, 6.65e+03, 3.00e+00, 1.00e+00, 2.00e+00, 1.00e+00,
# 1.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00,
# 6.05e+04],
# [1.00e+00, 6.36e+03, 2.00e+00, 1.00e+00, 1.00e+00, 1.00e+00,
# 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00,
# 6.10e+04]])
Explore NumPy's official documentation to manually split a dataset using a random sampling method of your choice. Some useful methods are located in the numpy.random library.
- Perform a random 80/20 split on data using a method of your choice in NumPy
- Split the data to create
x_train
,y_train
,x_test
, andy_test
arrays - Inspect the contents to see if the split performed as expected
Note: When randomly splitting data, it's always recommended to set a seed in order to ensure reproducibility
# Your code here
# Split results
# Raw data Shape: (546, 13)
# Train/Test Split: (437, 13) (109, 13)
# x_train, y_train, x_test, y_test: (437, 12) (437,) (109, 12) (109,)
With
- Using NumPy operations (transpose, inverse) that we saw earlier, compute the above equation in steps
- Print your beta values
# Your code here
# Beta values
# Due to random split, your answers may vary
# [-5.46637290e+03 3.62457767e+00 2.75100964e+03 1.47223649e+04
# 5.97774591e+03 5.71916945e+03 5.73109882e+03 3.83586258e+03
# 8.12674607e+03 1.33296437e+04 3.74995169e+03 1.01514699e+04]
Great, you now have a set of coefficients that describe the linear mappings between
- Create a new empty list (
y_pred
) for saving predictions - For each row of
x_test
, take the dot product of the row with beta to calculate the prediction for that row - Append the predictions to
y_pred
- Print the new set of predictions
# Your code here
This is exciting, now your model can use the beta value to predict the price of houses given the input features. Let's plot these predictions against the actual values in y_test
to see how much our model deviates.
# Plot predicted and actual values as line plots
This doesn't look so bad, does it? Your model, although isn't perfect at this stage, is making a good attempt to predict house prices although a few prediction seem a bit out. There could be a number of reasons for this. Let's try to dig a bit deeper to check model's predictive abilities by comparing these prediction with actual values of y_test
individually. That will help you calculate the RMSE value (root mean squared error) for your model.
Here is the formula for RMSE:
- Initialize an empty array
err
- For each row in
y_test
andy_pred
, take the squared difference and append error for each row in theerr
array - Calculate
$RMSE$ fromerr
using the formula shown above
# Calculate RMSE
# Due to random split, your answers may vary
# RMSE = 14868.172645765708
The above error is clearly in terms of the dependent variable, i.e., the final house price. You can also use a normalized mean squared error in case of multiple regression which can be calculated from RMSE using following the formula:
- Calculate normalized RMSE
# Calculate NRMSE
# Due to random split, your answers may vary
# 0.09011013724706489
There it is. A complete multiple regression analysis using nothing but NumPy. Having good programming skills in NumPy allows you to dig deeper into analytical algorithms in machine learning and deep learning. Using matrix multiplication techniques you saw here, you can easily build a whole neural network from scratch.
- Calculate the R-squared and adjusted R-squared for the above model
- Plot the residuals (similar to
statsmodels
) and comment on the variance and heteroscedasticity - Run the experiment in
statsmodels
and compare the performance of both approaches in terms of computational cost
In this lab, you built a predictive model for predicting house prices. Remember this is a very naive implementation of regression modeling. The purpose here was to get an introduction to the applications of linear algebra into machine learning and predictive analysis. There are a number of shortcomings in this modeling approach and you can further apply a number of data modeling techniques to improve this model.