In this lab, you'll practice fitting a multiple linear regression model on the Ames Housing dataset!
You will be able to:
- Determine if it is necessary to perform normalization/standardization for a specific model or set of data
- Use standardization/normalization on features of a dataset
- Identify if it is necessary to perform log transformations on a set of features
- Perform log transformations on different features of a dataset
- Use statsmodels to fit a multiple linear regression model
- Evaluate a linear regression model by using statistical performance metrics pertaining to overall model and specific parameters
Using the specified continuous and categorical features, preprocess your data to prepare for modeling:
- Split off and one hot encode the categorical features of interest
- Log and scale the selected continuous features
import pandas as pd
import numpy as np
ames = pd.read_csv('ames.csv')
continuous = ['LotArea', '1stFlrSF', 'GrLivArea', 'SalePrice']
categoricals = ['BldgType', 'KitchenQual', 'SaleType', 'MSZoning', 'Street', 'Neighborhood']
# Log transform and normalize
# One hot encode categoricals
# combine features into a single dataframe called preprocessed
# Your code here
# Your code here - Check that the coefficients and intercept are the same as those from Statsmodels
Make sure to transform your variables as needed!
- LotArea: 14977
- 1stFlrSF: 1976
- GrLivArea: 1976
- BldgType: 1Fam
- KitchenQual: Gd
- SaleType: New
- MSZoning: RL
- Street: Pave
- Neighborhood: NridgHt
Congratulations! You pre-processed the Ames Housing data using scaling and standardization. You also fitted your first multiple linear regression model on the Ames Housing data using statsmodels and scikit-learn!