Multiple Linear Regression in Statsmodels - Lab

Introduction

In this lab, you'll practice fitting a multiple linear regression model on the Ames Housing dataset!

Objectives

You will be able to:

Determine if it is necessary to perform normalization/standardization for a specific model or set of data
Use standardization/normalization on features of a dataset
Identify if it is necessary to perform log transformations on a set of features
Perform log transformations on different features of a dataset
Use statsmodels to fit a multiple linear regression model
Evaluate a linear regression model by using statistical performance metrics pertaining to overall model and specific parameters

The Ames Housing Data

Using the specified continuous and categorical features, preprocess your data to prepare for modeling:

Split off and one hot encode the categorical features of interest
Log and scale the selected continuous features

import pandas as pd
import numpy as np

ames = pd.read_csv('ames.csv')

continuous = ['LotArea', '1stFlrSF', 'GrLivArea', 'SalePrice']
categoricals = ['BldgType', 'KitchenQual', 'SaleType', 'MSZoning', 'Street', 'Neighborhood']

Continuous Features

# Log transform and normalize

Categorical Features

# One hot encode categoricals

Combine Categorical and Continuous Features

# combine features into a single dataframe called preprocessed

Run a linear model with SalePrice as the target variable in statsmodels

# Your code here

Run the same model in scikit-learn

# Your code here - Check that the coefficients and intercept are the same as those from Statsmodels

Predict the house price given the following characteristics (before manipulation!!)

Make sure to transform your variables as needed!

LotArea: 14977
1stFlrSF: 1976
GrLivArea: 1976
BldgType: 1Fam
KitchenQual: Gd
SaleType: New
MSZoning: RL
Street: Pave
Neighborhood: NridgHt

Summary

Congratulations! You pre-processed the Ames Housing data using scaling and standardization. You also fitted your first multiple linear regression model on the Ames Housing data using statsmodels and scikit-learn!

data-gab / dsc-multiple-linear-regression-statsmodels-lab