Final Project Submission

Please fill out:

Student name: Crystal Dugan
Student pace: self-paced
Scheduled project review date/time: date/time: 07-09-19 6:15pm EST
Instructor name: Eli Thomas
Blog post URL: https://cldugan.github.io/feature_selection_-_module_1_project

My Approach

My goal was to model the King County Housing dataset with a multivariate linear regression in order to predict the sale price of houses as accurately as possible. First I put the data in a pandas dataframe and looked over the summary. A few columns that were obviously not useful for my model were deleted. Then I cleaned the data and removed independent variables that had too much missing data. I looked at correlation for feature collinearity. I also looked at histograms and scatterplots to get a better understanding of my data. I separated out features that were categorical to look at later while I explored the continuous independent variables. Next I scaled and normalized (where necessary) the data. I ran an OLS model in StatsModel and looked at performance and quality. I tried a few iterations to see if I could improve my model. Then I explored the categorical variables, one-hot encoded the one I thought would be appropriate and added it to my model. I ran a final OLS model and checked the performance with k-folds cross validation and test-train-split.

Import Libraries

# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import RFE
from sklearn import preprocessing
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.formula.api import ols
import scipy.stats as stats
plt.style.use('bmh')
%matplotlib inline

Obtain Data

kc_df = pd.read_csv("kc_house_data.csv")

kc_df.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	id	date	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	waterfront	...	grade	sqft_above	sqft_basement	yr_built	yr_renovated	zipcode	lat	long	sqft_living15	sqft_lot15
0	7129300520	10/13/2014	221900.0	3	1.00	1180	5650	1.0	NaN	...	7	1180	0.0	1955	0.0	98178	47.5112	-122.257	1340	5650
1	6414100192	12/9/2014	538000.0	3	2.25	2570	7242	2.0	0.0	...	7	2170	400.0	1951	1991.0	98125	47.7210	-122.319	1690	7639
2	5631500400	2/25/2015	180000.0	2	1.00	770	10000	1.0	0.0	...	6	770	0.0	1933	NaN	98028	47.7379	-122.233	2720	8062
3	2487200875	12/9/2014	604000.0	4	3.00	1960	5000	1.0	0.0	...	7	1050	910.0	1965	0.0	98136	47.5208	-122.393	1360	5000
4	1954400510	2/18/2015	510000.0	3	2.00	1680	8080	1.0	0.0	...	8	1680	0.0	1987	0.0	98074	47.6168	-122.045	1800	7503

5 rows × 21 columns

Scrub / Explore Data

I have identified columns to delete before further exploration of the data.
id- will have no bearing on house value
date- not useful for linear regression

kc_df = kc_df.drop(['id', 'date'], axis=1)
kc_df.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	waterfront	condition	grade	sqft_above	sqft_basement	yr_built	yr_renovated	zipcode	lat	long	sqft_living15	sqft_lot15
0	221900.0	3	1.00	1180	5650	1.0	NaN	3	7	1180	0.0	1955	0.0	98178	47.5112	-122.257	1340	5650
1	538000.0	3	2.25	2570	7242	2.0	0.0	3	7	2170	400.0	1951	1991.0	98125	47.7210	-122.319	1690	7639
2	180000.0	2	1.00	770	10000	1.0	0.0	3	6	770	0.0	1933	NaN	98028	47.7379	-122.233	2720	8062
3	604000.0	4	3.00	1960	5000	1.0	0.0	5	7	1050	910.0	1965	0.0	98136	47.5208	-122.393	1360	5000
4	510000.0	3	2.00	1680	8080	1.0	0.0	3	8	1680	0.0	1987	0.0	98074	47.6168	-122.045	1800	7503

kc_df.describe()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	waterfront	view	condition	grade	sqft_above	yr_built	yr_renovated	zipcode	lat	long	sqft_living15	sqft_lot15
count	2.159700e+04	21597.000000	21597.000000	21597.000000	2.159700e+04	21597.000000	19221.000000	21534.000000	21597.000000	21597.000000	21597.000000	21597.000000	17755.000000	21597.000000	21597.000000	21597.000000	21597.000000	21597.000000
mean	5.402966e+05	3.373200	2.115826	2080.321850	1.509941e+04	1.494096	0.007596	0.233863	3.409825	7.657915	1788.596842	1970.999676	83.636778	98077.951845	47.560093	-122.213982	1986.620318	12758.283512
std	3.673681e+05	0.926299	0.768984	918.106125	4.141264e+04	0.539683	0.086825	0.765686	0.650546	1.173200	827.759761	29.375234	399.946414	53.513072	0.138552	0.140724	685.230472	27274.441950
min	7.800000e+04	1.000000	0.500000	370.000000	5.200000e+02	1.000000	0.000000	0.000000	1.000000	3.000000	370.000000	1900.000000	0.000000	98001.000000	47.155900	-122.519000	399.000000	651.000000
25%	3.220000e+05	3.000000	1.750000	1430.000000	5.040000e+03	1.000000	0.000000	0.000000	3.000000	7.000000	1190.000000	1951.000000	0.000000	98033.000000	47.471100	-122.328000	1490.000000	5100.000000
50%	4.500000e+05	3.000000	2.250000	1910.000000	7.618000e+03	1.500000	0.000000	0.000000	3.000000	7.000000	1560.000000	1975.000000	0.000000	98065.000000	47.571800	-122.231000	1840.000000	7620.000000
75%	6.450000e+05	4.000000	2.500000	2550.000000	1.068500e+04	2.000000	0.000000	0.000000	4.000000	8.000000	2210.000000	1997.000000	0.000000	98118.000000	47.678000	-122.125000	2360.000000	10083.000000
max	7.700000e+06	33.000000	8.000000	13540.000000	1.651359e+06	3.500000	1.000000	4.000000	5.000000	13.000000	9410.000000	2015.000000	2015.000000	98199.000000	47.777600	-121.315000	6210.000000	871200.000000

observations:

price- This is the dependent variable. House prices run 78,000 to 7,700,000. High max value and mean larger than median, data probably skewed.
bedrooms- 33 max value, possible outlier. closer look needed.
sqft_living- possible skew or outliers on the high end.
waterfront- categorical.
veiw- not sure this is useful.
yr_built and yr_renovated are dates, change to ages or min-max scaling will deal with issues?
zip code- will need to one-hot encode in order to use.
lat and long- interesting, but are they useful in a linear regression?

kc_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21597 entries, 0 to 21596
Data columns (total 19 columns):
price            21597 non-null float64
bedrooms         21597 non-null int64
bathrooms        21597 non-null float64
sqft_living      21597 non-null int64
sqft_lot         21597 non-null int64
floors           21597 non-null float64
waterfront       19221 non-null float64
view             21534 non-null float64
condition        21597 non-null int64
grade            21597 non-null int64
sqft_above       21597 non-null int64
sqft_basement    21597 non-null object
yr_built         21597 non-null int64
yr_renovated     17755 non-null float64
zipcode          21597 non-null int64
lat              21597 non-null float64
long             21597 non-null float64
sqft_living15    21597 non-null int64
sqft_lot15       21597 non-null int64
dtypes: float64(8), int64(10), object(1)
memory usage: 3.1+ MB

Observations-

waterfront, yr_renovated, and view have some missing data.
sqft_basement type indicates string which is weird and needs further examination.

kc_df.view.unique()

array([ 0., nan,  3.,  4.,  2.,  1.])

# deleting view because I don't think it has much influence on price, and it is mostly 0 or nan.
kc_df = kc_df.drop(['view'], axis=1)

# changing nan to 0 in waterfront because I think it is reasonable 
# to assume that if the house has water views it would be noted.

kc_df.waterfront.fillna(0, inplace=True)

kc_df.waterfront.unique()

array([0., 1.])

# examining sqft_basement because data type seems fishy. 
kc_df['sqft_basement'].unique()

array(['0.0', '400.0', '910.0', '1530.0', '?', '730.0', '1700.0', '300.0',
       '970.0', '760.0', '720.0', '700.0', '820.0', '780.0', '790.0',
       '330.0', '1620.0', '360.0', '588.0', '1510.0', '410.0', '990.0',
       '600.0', '560.0', '550.0', '1000.0', '1600.0', '500.0', '1040.0',
       '880.0', '1010.0', '240.0', '265.0', '290.0', '800.0', '540.0',
       '710.0', '840.0', '380.0', '770.0', '480.0', '570.0', '1490.0',
       '620.0', '1250.0', '1270.0', '120.0', '650.0', '180.0', '1130.0',
       '450.0', '1640.0', '1460.0', '1020.0', '1030.0', '750.0', '640.0',
       '1070.0', '490.0', '1310.0', '630.0', '2000.0', '390.0', '430.0',
       '850.0', '210.0', '1430.0', '1950.0', '440.0', '220.0', '1160.0',
       '860.0', '580.0', '2060.0', '1820.0', '1180.0', '200.0', '1150.0',
       '1200.0', '680.0', '530.0', '1450.0', '1170.0', '1080.0', '960.0',
       '280.0', '870.0', '1100.0', '460.0', '1400.0', '660.0', '1220.0',
       '900.0', '420.0', '1580.0', '1380.0', '475.0', '690.0', '270.0',
       '350.0', '935.0', '1370.0', '980.0', '1470.0', '160.0', '950.0',
       '50.0', '740.0', '1780.0', '1900.0', '340.0', '470.0', '370.0',
       '140.0', '1760.0', '130.0', '520.0', '890.0', '1110.0', '150.0',
       '1720.0', '810.0', '190.0', '1290.0', '670.0', '1800.0', '1120.0',
       '1810.0', '60.0', '1050.0', '940.0', '310.0', '930.0', '1390.0',
       '610.0', '1830.0', '1300.0', '510.0', '1330.0', '1590.0', '920.0',
       '1320.0', '1420.0', '1240.0', '1960.0', '1560.0', '2020.0',
       '1190.0', '2110.0', '1280.0', '250.0', '2390.0', '1230.0', '170.0',
       '830.0', '1260.0', '1410.0', '1340.0', '590.0', '1500.0', '1140.0',
       '260.0', '100.0', '320.0', '1480.0', '1060.0', '1284.0', '1670.0',
       '1350.0', '2570.0', '1090.0', '110.0', '2500.0', '90.0', '1940.0',
       '1550.0', '2350.0', '2490.0', '1481.0', '1360.0', '1135.0',
       '1520.0', '1850.0', '1660.0', '2130.0', '2600.0', '1690.0',
       '243.0', '1210.0', '1024.0', '1798.0', '1610.0', '1440.0',
       '1570.0', '1650.0', '704.0', '1910.0', '1630.0', '2360.0',
       '1852.0', '2090.0', '2400.0', '1790.0', '2150.0', '230.0', '70.0',
       '1680.0', '2100.0', '3000.0', '1870.0', '1710.0', '2030.0',
       '875.0', '1540.0', '2850.0', '2170.0', '506.0', '906.0', '145.0',
       '2040.0', '784.0', '1750.0', '374.0', '518.0', '2720.0', '2730.0',
       '1840.0', '3480.0', '2160.0', '1920.0', '2330.0', '1860.0',
       '2050.0', '4820.0', '1913.0', '80.0', '2010.0', '3260.0', '2200.0',
       '415.0', '1730.0', '652.0', '2196.0', '1930.0', '515.0', '40.0',
       '2080.0', '2580.0', '1548.0', '1740.0', '235.0', '861.0', '1890.0',
       '2220.0', '792.0', '2070.0', '4130.0', '2250.0', '2240.0',
       '1990.0', '768.0', '2550.0', '435.0', '1008.0', '2300.0', '2610.0',
       '666.0', '3500.0', '172.0', '1816.0', '2190.0', '1245.0', '1525.0',
       '1880.0', '862.0', '946.0', '1281.0', '414.0', '2180.0', '276.0',
       '1248.0', '602.0', '516.0', '176.0', '225.0', '1275.0', '266.0',
       '283.0', '65.0', '2310.0', '10.0', '1770.0', '2120.0', '295.0',
       '207.0', '915.0', '556.0', '417.0', '143.0', '508.0', '2810.0',
       '20.0', '274.0', '248.0'], dtype=object)

# Aha, there is "?" hiding in the data as placeholder.
(kc_df['sqft_basement']== '?').sum()

454 mising data points. I can either drop basement or drop the rows with '?'. I am going with dropping the column. Dropping rows would lose ~2% of the data.
Also I suspect it is colinear with other size features.

kc_df = kc_df.drop(['sqft_basement'], axis=1)

kc_df.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	condition	grade	sqft_above	yr_built	yr_renovated	zipcode	lat	long	sqft_living15	sqft_lot15
0	221900.0	3	1.00	1180	5650	1.0	3	7	1180	1955	0.0	98178	47.5112	-122.257	1340	5650
1	538000.0	3	2.25	2570	7242	2.0	3	7	2170	1951	1991.0	98125	47.7210	-122.319	1690	7639
2	180000.0	2	1.00	770	10000	1.0	3	6	770	1933	NaN	98028	47.7379	-122.233	2720	8062
3	604000.0	4	3.00	1960	5000	1.0	5	7	1050	1965	0.0	98136	47.5208	-122.393	1360	5000
4	510000.0	3	2.00	1680	8080	1.0	3	8	1680	1987	0.0	98074	47.6168	-122.045	1800	7503

kc_df.isna().sum()

price               0
bedrooms            0
bathrooms           0
sqft_living         0
sqft_lot            0
floors              0
waterfront          0
condition           0
grade               0
sqft_above          0
yr_built            0
yr_renovated     3842
zipcode             0
lat                 0
long                0
sqft_living15       0
sqft_lot15          0
dtype: int64

#Dropping yr_renovated, too much missing data.

kc_df = kc_df.drop('yr_renovated', axis = 1)

# lets see what these features look like.
kc_df.hist(figsize = (12,12));

Observations - Some data looks skewed, confirming earlier observations. Some data looks categorical, once data is cleaned I will look at linear relationships to see if I should treat as categorical or continuous

#dropping the row that has the 33 bedroom outier

kc_df = kc_df[kc_df.bedrooms != 33]
kc_df.describe()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	waterfront	condition	grade	sqft_above	yr_built	zipcode	lat	long	sqft_living15	sqft_lot15
count	2.159600e+04	21596.000000	21596.000000	21596.000000	2.159600e+04	21596.000000	21596.000000	21596.000000	21596.000000	21596.000000	21596.000000	21596.000000	21596.000000	21596.000000	21596.000000	21596.000000
mean	5.402920e+05	3.371828	2.115843	2080.343165	1.509983e+04	1.494119	0.006761	3.409752	7.657946	1788.631506	1971.000787	98077.950685	47.560087	-122.213977	1986.650722	12758.656649
std	3.673760e+05	0.904114	0.768998	918.122038	4.141355e+04	0.539685	0.081946	0.650471	1.173218	827.763251	29.375460	53.514040	0.138552	0.140725	685.231768	27275.018316
min	7.800000e+04	1.000000	0.500000	370.000000	5.200000e+02	1.000000	0.000000	1.000000	3.000000	370.000000	1900.000000	98001.000000	47.155900	-122.519000	399.000000	651.000000
25%	3.220000e+05	3.000000	1.750000	1430.000000	5.040000e+03	1.000000	0.000000	3.000000	7.000000	1190.000000	1951.000000	98033.000000	47.471100	-122.328000	1490.000000	5100.000000
50%	4.500000e+05	3.000000	2.250000	1910.000000	7.619000e+03	1.500000	0.000000	3.000000	7.000000	1560.000000	1975.000000	98065.000000	47.571800	-122.231000	1840.000000	7620.000000
75%	6.450000e+05	4.000000	2.500000	2550.000000	1.068550e+04	2.000000	0.000000	4.000000	8.000000	2210.000000	1997.000000	98118.000000	47.678000	-122.125000	2360.000000	10083.000000
max	7.700000e+06	11.000000	8.000000	13540.000000	1.651359e+06	3.500000	1.000000	5.000000	13.000000	9410.000000	2015.000000	98199.000000	47.777600	-121.315000	6210.000000	871200.000000

# lets take a closer look at continuous variables
column_list = ['price', 'bathrooms', 'bedrooms', 'condition', 'floors', 'grade', 'sqft_above', 'sqft_living', 'sqft_living15', 'sqft_lot', 'sqft_lot15','yr_built']
for col in column_list:
    sns.distplot(kc_df[col])
    plt.title(col)
    plt.show();

I think some of the variables are too skewed to use as-is. Some also have a lot of "peakedness". Some look categorical. I will try log-transformations and look at scatterplots to check for linear relationships with the target.

Some features are discrete and not continuous varibles. I am trying to decide whether to handle them as continuous or categorical varibles.

column_list = ['bathrooms', 'bedrooms', 'condition', 'floors', 'grade']
for col in column_list:
    f, ax = plt.subplots(figsize=(12,6))
    sns.violinplot(x = kc_df[col], y = kc_df['price'])
    plt.title(col)
    plt.show();

'grade' and 'bathrooms' seem to show the strongest positive correlations with price. I am suprised the other features don't show a stronger relationship. But nothing is screaming categorical to me at this stage so I will keep them in as continuous variables for now.

# Making a DF of cleaned features that I will normalize, scale, and then model.
# Dropping zipcode, lat, and long for now, these will have to be one-hot encoded if used.  
# Will explore later on and add back to model.

data_pred= kc_df.drop([ 'zipcode', 'lat', 'long'], axis =1)
data_pred.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	condition	grade	sqft_above	yr_built	sqft_living15	sqft_lot15
0	221900.0	3	1.00	1180	5650	1.0	3	7	1180	1955	1340	5650
1	538000.0	3	2.25	2570	7242	2.0	3	7	2170	1951	1690	7639
2	180000.0	2	1.00	770	10000	1.0	3	6	770	1933	2720	8062
3	604000.0	4	3.00	1960	5000	1.0	5	7	1050	1965	1360	5000
4	510000.0	3	2.00	1680	8080	1.0	3	8	1680	1987	1800	7503

Correlation and Collinearity

How well are the variables correlated with the target and is there any collinearity to be worried about?

# plotting heatmap of variables for correlation and collinearity
plt.figure(figsize=(12,10))
corr = abs(data_pred.corr())
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
    ax = sns.heatmap(corr, mask=mask, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap');

Target variable is 'price'. The features that are most highly correlated with 'price' are:
'sqft_living', 'grade', and 'sqft_above'. The least correlated: 'condition', 'yr_built', and 'sqft_lot15'.

Dropping 'sqft_above' because it is highly colinear with 'sqft_living', and will bias my model.
It also seems like they are measuring almost the same thing.
I am changing 'waterfront's type to category.

data_pred = data_pred.drop(['sqft_above'], axis=1)

data_pred['waterfront'] = data_pred.waterfront.astype('category')

Scaling and Normalization

# Looking at how skewed the data is.
data_pred.skew()

price             4.023329
bedrooms          0.551382
bathrooms         0.519644
sqft_living       1.473143
sqft_lot         13.072315
floors            0.614427
waterfront       12.039300
condition         1.036107
grade             0.788166
yr_built         -0.469549
sqft_living15     1.106828
sqft_lot15        9.524159
dtype: float64

I am log-transforming the obviously skewed data. (-1 < Skew < 1)
(Except 'waterfront' because it is now categorical.)

# updating my exploration df with transformed variables
data_pred["sqft_living"] = np.log(data_pred["sqft_living"])
data_pred["sqft_lot"] = np.log(data_pred["sqft_lot"])
data_pred["sqft_lot15"] = np.log(data_pred["sqft_lot15"])
data_pred["price"] = np.log(data_pred["price"])
data_pred.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	condition	grade	yr_built	sqft_living15	sqft_lot15
0	12.309982	3	1.00	7.073270	8.639411	1.0	3	7	1955	1340	8.639411
1	13.195614	3	2.25	7.851661	8.887653	2.0	3	7	1951	1690	8.941022
2	12.100712	2	1.00	6.646391	9.210340	1.0	3	6	1933	2720	8.994917
3	13.311329	4	3.00	7.580700	8.517193	1.0	5	7	1965	1360	8.517193
4	13.142166	3	2.00	7.426549	8.997147	1.0	3	8	1987	1800	8.923058

Did the log-transfomations improve distributions?

# plots of transformed variables to see if improved by log-transformation.
column_list = ['price', 'sqft_living', 'sqft_living15', 'sqft_lot', 'sqft_lot15']
for col in column_list:
    sns.distplot(data_pred[col])
    plt.title(col)
    plt.show();

These features now have a more normal distibution. A bit "peaky" but I think good enough to move on to checking visually for linear relationships with the target.

Do the features have a linear relationships with the target?

# With jointplots we can look at the transformed histograms and KDE as well as linear relationships with target.
for column in data_pred.drop('price', axis=1):
    sns.jointplot(x=column, y='price',
                  data=data_pred, 
                  kind='reg',
                  space=0.0,
                  label=column,
                  joint_kws={'line_kws':{'color':'green'}})
    #plt.title("Price vs " + column)
    plt.legend()
    plt.show()

Looking at the histograms with density estimates, we can see that data that was log-transformed now have a less skewed distribution. With the scatterplots with the best-fit line drawn, I am satisfied that the independent variables have a good enough linear relationship with the target as to continue with modeling. The best linear relationships are with sqft_living and grade. I am going to treat the discrete data variables as continuous data (except for waterfront which is categorical) as they appear to have a linear relationship with the target.

MinMax scaling the data

# scaling the data. I am chosing to min-max scale the data so everything will be on the same scale of 0-1.  


data_minMax = data_pred.drop(['waterfront', 'price'], axis=1)

for column in data_minMax:
    data_minMax[column] = (data_minMax[column]-min(data_minMax[column]))/(max(data_minMax[column])-min(data_minMax[column]))

data_df = pd.concat([data_minMax, data_pred[['price','waterfront']]], axis=1)    
data_df.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	bedrooms	bathrooms	sqft_living	sqft_lot	floors	condition	grade	yr_built	sqft_living15	sqft_lot15	price
0	0.2	0.066667	0.322166	0.295858	0.0	0.5	0.4	0.478261	0.161934	0.300162	12.309982
1	0.2	0.233333	0.538392	0.326644	0.4	0.5	0.4	0.443478	0.222165	0.342058	13.195614
2	0.1	0.066667	0.203585	0.366664	0.0	0.5	0.3	0.286957	0.399415	0.349544	12.100712
3	0.3	0.333333	0.463123	0.280700	0.0	1.0	0.4	0.565217	0.165376	0.283185	13.311329
4	0.2	0.200000	0.420302	0.340224	0.0	0.5	0.5	0.756522	0.241094	0.339562	13.142166

# sanity check

data_df.describe()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	bedrooms	bathrooms	sqft_living	sqft_lot	floors	condition	grade	yr_built	sqft_living15	sqft_lot15	price
count	21596.000000	21596.000000	21596.000000	21596.000000	21596.000000	21596.000000	21596.000000	21596.000000	21596.000000	21596.000000	21596.000000
mean	0.237183	0.215446	0.454797	0.339315	0.197648	0.602438	0.465795	0.617398	0.273215	0.344802	13.048196
std	0.090411	0.102533	0.117836	0.111877	0.215874	0.162618	0.117322	0.255439	0.117920	0.112878	0.526562
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	11.264464
25%	0.200000	0.166667	0.375546	0.281688	0.000000	0.500000	0.400000	0.443478	0.187747	0.285936	12.682307
50%	0.200000	0.233333	0.455945	0.332938	0.200000	0.500000	0.400000	0.652174	0.247978	0.341712	13.017003
75%	0.300000	0.266667	0.536222	0.374886	0.400000	0.750000	0.500000	0.843478	0.337463	0.380616	13.377006
max	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	15.856731

# setting waterfront as type category.
data_df['waterfront'] = data_df.waterfront.astype('category')

#making sure everything looks right
data_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21596 entries, 0 to 21596
Data columns (total 12 columns):
bedrooms         21596 non-null float64
bathrooms        21596 non-null float64
sqft_living      21596 non-null float64
sqft_lot         21596 non-null float64
floors           21596 non-null float64
condition        21596 non-null float64
grade            21596 non-null float64
yr_built         21596 non-null float64
sqft_living15    21596 non-null float64
sqft_lot15       21596 non-null float64
price            21596 non-null float64
waterfront       21596 non-null category
dtypes: category(1), float64(11)
memory usage: 2.6 MB

Modeling the Data

Running Ordinary Least Squares regression experiments in Statsmodels

I am using StatsModels because the summay contains a lot of information and the layout is easy to read.

outcome = 'price'
predictors = data_df.drop(['price'], axis=1)
pred_sum = "+".join(predictors.columns)
formula = outcome + "~" + pred_sum
model = ols(formula= formula, data=data_df).fit()
model.summary()

OLS Regression Results

Dep. Variable:	price	R-squared:	0.658
Model:	OLS	Adj. R-squared:	0.657
Method:	Least Squares	F-statistic:	3767.
Date:	Sun, 01 Mar 2020	Prob (F-statistic):	0.00
Time:	16:34:33	Log-Likelihood:	-5221.1
No. Observations:	21596	AIC:	1.047e+04
Df Residuals:	21584	BIC:	1.056e+04
Df Model:	11
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	11.7254	0.015	766.698	0.000	11.695	11.755
waterfront[T.1.0]	0.5532	0.026	21.375	0.000	0.503	0.604
bedrooms	-0.4225	0.031	-13.482	0.000	-0.484	-0.361
bathrooms	0.6303	0.036	17.288	0.000	0.559	0.702
sqft_living	1.2824	0.040	31.807	0.000	1.203	1.361
sqft_lot	-0.1437	0.049	-2.956	0.003	-0.239	-0.048
floors	0.1158	0.013	8.900	0.000	0.090	0.141
condition	0.1716	0.014	12.188	0.000	0.144	0.199
grade	2.0690	0.031	65.887	0.000	2.007	2.131
yr_built	-0.6877	0.011	-63.881	0.000	-0.709	-0.667
sqft_living15	0.7683	0.030	25.985	0.000	0.710	0.826
sqft_lot15	-0.3664	0.048	-7.646	0.000	-0.460	-0.272

Omnibus:	48.465	Durbin-Watson:	1.965
Prob(Omnibus):	0.000	Jarque-Bera (JB):	56.757
Skew:	-0.056	Prob(JB):	4.74e-13
Kurtosis:	3.225	Cond. No.	51.7

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

An adjusted R-squared of 0.657 is pretty good, considering I still have data to add to the model.
All p-values are better than 0.05. Skew shows not much tailing. A kurtosis of 3.225 is close to expected value of 3 for normal distibution.
However, the high JB score tells me that the data may not be normally distributed. 'bedrooms' , sqft_lot', 'yr_built', and 'sqft_lot15' have a negative coefficient. something to keep an eye on.

Are the residuals consistent with a normal distribution?

# q-q plot to visualize the residuals

residuals = model.resid
fig = sm.graphics.qqplot(residuals, dist=stats.norm, line='45', fit=True)
plt.title("Q-Q plot of Residuals");

The Q-Q plot shows some peakyness or tailing perhaps, but for the most part seems close to a normal distibution. Real life data
won't be perfect!

# Checking model performance with cross-validaton
linreg = LinearRegression()

X = data_df.drop(['price'], axis=1)
y = data_df.price

cv_results = cross_val_score(linreg, X, y, cv=10, scoring="neg_mean_squared_error")
cv_results

array([-0.09485161, -0.09953494, -0.09634111, -0.09786468, -0.09079897,
       -0.09538301, -0.09304397, -0.10168181, -0.09753282, -0.09143151])

np.mean(cv_results)

-0.09584644256982355

Results look pretty consistent. A good sign.

# Running recursive feature elimination to look for candidate variables to drop to see if I can improve model.

predictors = data_df.drop(['price', 'waterfront'], axis=1)

linreg = LinearRegression()
selector = RFE(linreg, n_features_to_select = 1)
selector = selector.fit(predictors, data_df["price"])
list(zip(predictors.columns,selector.ranking_))

[('bedrooms', 7),
 ('bathrooms', 3),
 ('sqft_living', 2),
 ('sqft_lot', 8),
 ('floors', 10),
 ('condition', 9),
 ('grade', 1),
 ('yr_built', 4),
 ('sqft_living15', 5),
 ('sqft_lot15', 6)]

# dropping floors (the worst ranked) to see if it improves model. 

data_dr = data_df.drop(['floors'], axis=1)

# no floors
outcome = 'price'
predictors1 = data_dr.drop(['price'], axis=1)
pred_sum1 = "+".join(predictors1.columns)
formula1 = outcome + "~" + pred_sum1
model1 = ols(formula= formula1, data=data_dr).fit()
model1.summary()

OLS Regression Results

Dep. Variable:	price	R-squared:	0.656
Model:	OLS	Adj. R-squared:	0.656
Method:	Least Squares	F-statistic:	4121.
Date:	Sun, 01 Mar 2020	Prob (F-statistic):	0.00
Time:	16:36:48	Log-Likelihood:	-5260.7
No. Observations:	21596	AIC:	1.054e+04
Df Residuals:	21585	BIC:	1.063e+04
Df Model:	10
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	11.7295	0.015	765.928	0.000	11.700	11.760
waterfront[T.1.0]	0.5575	0.026	21.506	0.000	0.507	0.608
bedrooms	-0.4345	0.031	-13.853	0.000	-0.496	-0.373
bathrooms	0.6855	0.036	19.048	0.000	0.615	0.756
sqft_living	1.3036	0.040	32.330	0.000	1.225	1.383
sqft_lot	-0.1902	0.048	-3.928	0.000	-0.285	-0.095
condition	0.1570	0.014	11.206	0.000	0.130	0.184
grade	2.1125	0.031	67.984	0.000	2.052	2.173
yr_built	-0.6655	0.010	-63.435	0.000	-0.686	-0.645
sqft_living15	0.7643	0.030	25.805	0.000	0.706	0.822
sqft_lot15	-0.3904	0.048	-8.145	0.000	-0.484	-0.296

Omnibus:	49.190	Durbin-Watson:	1.962
Prob(Omnibus):	0.000	Jarque-Bera (JB):	56.684
Skew:	-0.063	Prob(JB):	4.91e-13
Kurtosis:	3.217	Cond. No.	51.3

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

linreg = LinearRegression()

X = data_dr.drop(['price'], axis=1)
y = data_dr.price

cv_results1 = cross_val_score(linreg, X, y, cv=10, scoring="neg_mean_squared_error")
np.mean(cv_results1)

-0.0962495949425783

Dropping that variable didn't improve my model (actually slightly hurt it).

# seeing if model improves when the categorical variable 'waterfront' is removed.  It is mostly 0's.
data_dr_water = data_df.drop('waterfront', axis=1)

outcome = 'price'
predictors2 = data_dr_water.drop(['price'], axis=1)
pred_sum2 = "+".join(predictors2.columns)
formula2 = outcome + "~" + pred_sum2
model2 = ols(formula= formula2, data=data_dr_water).fit()
model2.summary()

OLS Regression Results

Dep. Variable:	price	R-squared:	0.650
Model:	OLS	Adj. R-squared:	0.650
Method:	Least Squares	F-statistic:	4013.
Date:	Sun, 01 Mar 2020	Prob (F-statistic):	0.00
Time:	16:42:54	Log-Likelihood:	-5447.3
No. Observations:	21596	AIC:	1.092e+04
Df Residuals:	21585	BIC:	1.100e+04
Df Model:	10
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	11.7157	0.015	758.434	0.000	11.685	11.746
bedrooms	-0.4667	0.032	-14.769	0.000	-0.529	-0.405
bathrooms	0.6622	0.037	17.991	0.000	0.590	0.734
sqft_living	1.2975	0.041	31.853	0.000	1.218	1.377
sqft_lot	-0.1571	0.049	-3.197	0.001	-0.253	-0.061
floors	0.1209	0.013	9.204	0.000	0.095	0.147
condition	0.1724	0.014	12.118	0.000	0.145	0.200
grade	2.0878	0.032	65.822	0.000	2.026	2.150
yr_built	-0.7056	0.011	-65.053	0.000	-0.727	-0.684
sqft_living15	0.7740	0.030	25.906	0.000	0.715	0.833
sqft_lot15	-0.3259	0.048	-6.737	0.000	-0.421	-0.231

Omnibus:	50.905	Durbin-Watson:	1.967
Prob(Omnibus):	0.000	Jarque-Bera (JB):	64.257
Skew:	-0.015	Prob(JB):	1.11e-14
Kurtosis:	3.266	Cond. No.	51.7

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

linreg = LinearRegression()

X = data_dr.drop(['price','waterfront'], axis=1)
y = data_dr.price

cv_results2 = cross_val_score(linreg, X, y, cv=10, scoring="neg_mean_squared_error")
np.mean(cv_results2)

-0.09832069899904562

Removing waterfront hurts my model. I will keep the original model and move on with examining lat, long, and zipcode.

Categorical Variables

Location, location, location

I have three independent variables left to examine; lat, long, and zipcode. These all are geographical data indicating where the houses are located. I will look at each one to see how best to add to my model.

# How many unique values?
kc_df.zipcode.nunique()

kc_df.lat.nunique()

kc_df.long.nunique()

Does location affect price?

# Scatter plot of long and lat color mapped to log-transformed price data.

kc_df.plot(kind="scatter", x="long", y="lat", alpha=0.4, figsize=(16,10),
    c=data_pred["price"], cmap="rainbow", colorbar=True,
    sharex=False)
plt.title("Location and Log-Transformed Price")
plt.show()

Obviously location affects price! You can see how the house prices are higher in certain areas.

# Looking at a hexbin plot for density of locations (just curious)
kc_df.plot.hexbin(x='long', y='lat', figsize=(12,8))
plt.title("Housing Density");

Moving on to look at zipcode

kc_df['zipcode'] = kc_df.zipcode.astype('int')

kc_df.plot(kind="scatter", x="long", y="lat", alpha=0.6, figsize=(12,8),
    c='zipcode', cmap="rainbow", colorbar=True,
    sharex=False)
plt.title("Zipcode Locations");

I don't see any way to easily group zipcodes. I will do some further exploration, but will probably end up one-hot encoding the feature into categories.

The lat and long gave me a nice graph that showed the importance of location on price, but I don't see a way to easily use in my linear regression model. I am thinking using lat and long and zipcode would be redundant anyway, so I will use zipcode in my model.

# looking at Zipcode
kc_df.zipcode.hist(bins = 70, figsize=(10,10), label='zipcode')
plt.legend()
plt.title('Zipcode');

Not normally distributed at all. Defininitely categorical.

kc_df['zipcode'].value_counts()

98103    601
98038    589
98115    583
98052    574
98117    553
        ... 
98102    104
98010    100
98024     80
98148     57
98039     50
Name: zipcode, Length: 70, dtype: int64

# One-hot encoding 'zipcode' variable and adding to my model.

kc_df['zipcode'] = kc_df.zipcode.astype('str')
zip_dummy = pd.get_dummies(kc_df.zipcode, prefix = 'ZC')

final_df = pd.concat([data_df, zip_dummy], axis=1)

# Checking data to see if everything is how I want it.
final_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21596 entries, 0 to 21596
Data columns (total 82 columns):
bedrooms         21596 non-null float64
bathrooms        21596 non-null float64
sqft_living      21596 non-null float64
sqft_lot         21596 non-null float64
floors           21596 non-null float64
condition        21596 non-null float64
grade            21596 non-null float64
yr_built         21596 non-null float64
sqft_living15    21596 non-null float64
sqft_lot15       21596 non-null float64
price            21596 non-null float64
waterfront       21596 non-null category
ZC_98001         21596 non-null uint8
ZC_98002         21596 non-null uint8
ZC_98003         21596 non-null uint8
ZC_98004         21596 non-null uint8
ZC_98005         21596 non-null uint8
ZC_98006         21596 non-null uint8
ZC_98007         21596 non-null uint8
ZC_98008         21596 non-null uint8
ZC_98010         21596 non-null uint8
ZC_98011         21596 non-null uint8
ZC_98014         21596 non-null uint8
ZC_98019         21596 non-null uint8
ZC_98022         21596 non-null uint8
ZC_98023         21596 non-null uint8
ZC_98024         21596 non-null uint8
ZC_98027         21596 non-null uint8
ZC_98028         21596 non-null uint8
ZC_98029         21596 non-null uint8
ZC_98030         21596 non-null uint8
ZC_98031         21596 non-null uint8
ZC_98032         21596 non-null uint8
ZC_98033         21596 non-null uint8
ZC_98034         21596 non-null uint8
ZC_98038         21596 non-null uint8
ZC_98039         21596 non-null uint8
ZC_98040         21596 non-null uint8
ZC_98042         21596 non-null uint8
ZC_98045         21596 non-null uint8
ZC_98052         21596 non-null uint8
ZC_98053         21596 non-null uint8
ZC_98055         21596 non-null uint8
ZC_98056         21596 non-null uint8
ZC_98058         21596 non-null uint8
ZC_98059         21596 non-null uint8
ZC_98065         21596 non-null uint8
ZC_98070         21596 non-null uint8
ZC_98072         21596 non-null uint8
ZC_98074         21596 non-null uint8
ZC_98075         21596 non-null uint8
ZC_98077         21596 non-null uint8
ZC_98092         21596 non-null uint8
ZC_98102         21596 non-null uint8
ZC_98103         21596 non-null uint8
ZC_98105         21596 non-null uint8
ZC_98106         21596 non-null uint8
ZC_98107         21596 non-null uint8
ZC_98108         21596 non-null uint8
ZC_98109         21596 non-null uint8
ZC_98112         21596 non-null uint8
ZC_98115         21596 non-null uint8
ZC_98116         21596 non-null uint8
ZC_98117         21596 non-null uint8
ZC_98118         21596 non-null uint8
ZC_98119         21596 non-null uint8
ZC_98122         21596 non-null uint8
ZC_98125         21596 non-null uint8
ZC_98126         21596 non-null uint8
ZC_98133         21596 non-null uint8
ZC_98136         21596 non-null uint8
ZC_98144         21596 non-null uint8
ZC_98146         21596 non-null uint8
ZC_98148         21596 non-null uint8
ZC_98155         21596 non-null uint8
ZC_98166         21596 non-null uint8
ZC_98168         21596 non-null uint8
ZC_98177         21596 non-null uint8
ZC_98178         21596 non-null uint8
ZC_98188         21596 non-null uint8
ZC_98198         21596 non-null uint8
ZC_98199         21596 non-null uint8
dtypes: category(1), float64(11), uint8(70)
memory usage: 4.1 MB

# dropping one of the zipcode columns

final_df = final_df.drop('ZC_98004', axis =1)

# Modeling dataset with multivariate linear regression.
outcome = 'price'
predictors_fin = final_df.drop(['price'], axis=1)
pred_sum_fin = "+".join(predictors_fin.columns)
formula_fin = outcome + "~" + pred_sum_fin
model_fin = ols(formula= formula_fin, data=final_df).fit()
model_fin.summary()

OLS Regression Results

Dep. Variable:	price	R-squared:	0.876
Model:	OLS	Adj. R-squared:	0.876
Method:	Least Squares	F-statistic:	1900.
Date:	Sun, 01 Mar 2020	Prob (F-statistic):	0.00
Time:	16:45:18	Log-Likelihood:	5749.7
No. Observations:	21596	AIC:	-1.134e+04
Df Residuals:	21515	BIC:	-1.069e+04
Df Model:	80
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	12.1556	0.016	775.067	0.000	12.125	12.186
waterfront[T.1.0]	0.6583	0.016	41.414	0.000	0.627	0.689
bedrooms	-0.1885	0.019	-9.805	0.000	-0.226	-0.151
bathrooms	0.3335	0.022	15.025	0.000	0.290	0.377
sqft_living	1.3788	0.024	56.282	0.000	1.331	1.427
sqft_lot	0.6374	0.030	21.233	0.000	0.579	0.696
floors	0.0325	0.008	3.955	0.000	0.016	0.049
condition	0.1875	0.009	21.423	0.000	0.170	0.205
grade	1.0339	0.020	50.846	0.000	0.994	1.074
yr_built	-0.0830	0.008	-9.992	0.000	-0.099	-0.067
sqft_living15	0.5665	0.019	29.827	0.000	0.529	0.604
sqft_lot15	-0.1606	0.030	-5.417	0.000	-0.219	-0.102
ZC_98001	-1.1096	0.015	-76.305	0.000	-1.138	-1.081
ZC_98002	-1.1035	0.017	-64.573	0.000	-1.137	-1.070
ZC_98003	-1.0927	0.015	-71.045	0.000	-1.123	-1.063
ZC_98005	-0.4173	0.018	-23.519	0.000	-0.452	-0.383
ZC_98006	-0.4820	0.013	-36.050	0.000	-0.508	-0.456
ZC_98007	-0.4771	0.019	-25.294	0.000	-0.514	-0.440
ZC_98008	-0.4497	0.015	-29.402	0.000	-0.480	-0.420
ZC_98010	-0.8701	0.022	-40.265	0.000	-0.912	-0.828
ZC_98011	-0.6794	0.017	-39.961	0.000	-0.713	-0.646
ZC_98014	-0.8082	0.020	-40.051	0.000	-0.848	-0.769
ZC_98019	-0.7985	0.017	-46.039	0.000	-0.832	-0.764
ZC_98022	-1.0278	0.016	-62.777	0.000	-1.060	-0.996
ZC_98023	-1.1444	0.013	-84.799	0.000	-1.171	-1.118
ZC_98024	-0.6904	0.024	-29.229	0.000	-0.737	-0.644
ZC_98027	-0.6202	0.014	-44.335	0.000	-0.648	-0.593
ZC_98028	-0.7035	0.015	-45.945	0.000	-0.733	-0.673
ZC_98029	-0.5153	0.015	-34.582	0.000	-0.545	-0.486
ZC_98030	-1.0611	0.016	-67.117	0.000	-1.092	-1.030
ZC_98031	-1.0467	0.016	-67.396	0.000	-1.077	-1.016
ZC_98032	-1.1327	0.020	-57.270	0.000	-1.171	-1.094
ZC_98033	-0.3207	0.014	-23.236	0.000	-0.348	-0.294
ZC_98034	-0.5660	0.013	-42.614	0.000	-0.592	-0.540
ZC_98038	-0.9435	0.013	-71.255	0.000	-0.969	-0.917
ZC_98039	0.1698	0.028	5.996	0.000	0.114	0.225
ZC_98040	-0.2411	0.015	-15.818	0.000	-0.271	-0.211
ZC_98042	-1.0480	0.013	-78.388	0.000	-1.074	-1.022
ZC_98045	-0.7847	0.017	-47.180	0.000	-0.817	-0.752
ZC_98052	-0.4951	0.013	-37.861	0.000	-0.521	-0.469
ZC_98053	-0.5381	0.014	-37.952	0.000	-0.566	-0.510
ZC_98055	-0.9589	0.016	-61.434	0.000	-0.990	-0.928
ZC_98056	-0.7780	0.014	-55.091	0.000	-0.806	-0.750
ZC_98058	-0.9558	0.014	-69.500	0.000	-0.983	-0.929
ZC_98059	-0.7780	0.014	-56.945	0.000	-0.805	-0.751
ZC_98065	-0.6962	0.015	-46.067	0.000	-0.726	-0.667
ZC_98070	-0.7864	0.021	-37.660	0.000	-0.827	-0.745
ZC_98072	-0.6629	0.015	-42.806	0.000	-0.693	-0.633
ZC_98074	-0.5790	0.014	-42.117	0.000	-0.606	-0.552
ZC_98075	-0.5751	0.014	-39.906	0.000	-0.603	-0.547
ZC_98077	-0.7214	0.017	-42.178	0.000	-0.755	-0.688
ZC_98092	-1.0924	0.015	-74.832	0.000	-1.121	-1.064
ZC_98102	-0.1388	0.021	-6.482	0.000	-0.181	-0.097
ZC_98103	-0.2641	0.014	-19.545	0.000	-0.291	-0.238
ZC_98105	-0.1567	0.016	-9.527	0.000	-0.189	-0.124
ZC_98106	-0.7337	0.015	-49.121	0.000	-0.763	-0.704
ZC_98107	-0.2353	0.016	-14.759	0.000	-0.266	-0.204
ZC_98108	-0.7362	0.017	-42.266	0.000	-0.770	-0.702
ZC_98109	-0.0950	0.021	-4.523	0.000	-0.136	-0.054
ZC_98112	-0.0725	0.016	-4.592	0.000	-0.103	-0.042
ZC_98115	-0.2788	0.013	-20.967	0.000	-0.305	-0.253
ZC_98116	-0.3103	0.015	-20.784	0.000	-0.340	-0.281
ZC_98117	-0.2762	0.014	-20.441	0.000	-0.303	-0.250
ZC_98118	-0.6203	0.014	-45.474	0.000	-0.647	-0.594
ZC_98119	-0.1012	0.018	-5.730	0.000	-0.136	-0.067
ZC_98122	-0.2789	0.016	-17.879	0.000	-0.310	-0.248
ZC_98125	-0.5284	0.014	-37.362	0.000	-0.556	-0.501
ZC_98126	-0.5174	0.015	-35.092	0.000	-0.546	-0.488
ZC_98133	-0.6402	0.014	-46.879	0.000	-0.667	-0.613
ZC_98136	-0.3872	0.016	-24.552	0.000	-0.418	-0.356
ZC_98144	-0.4040	0.015	-27.193	0.000	-0.433	-0.375
ZC_98146	-0.7953	0.015	-51.629	0.000	-0.825	-0.765
ZC_98148	-0.9436	0.027	-35.110	0.000	-0.996	-0.891
ZC_98155	-0.6741	0.014	-48.689	0.000	-0.701	-0.647
ZC_98166	-0.7819	0.016	-49.471	0.000	-0.813	-0.751
ZC_98168	-1.0264	0.016	-65.208	0.000	-1.057	-0.996
ZC_98177	-0.4912	0.016	-31.329	0.000	-0.522	-0.460
ZC_98178	-0.9335	0.016	-59.218	0.000	-0.964	-0.903
ZC_98188	-1.0007	0.019	-52.055	0.000	-1.038	-0.963
ZC_98198	-1.0156	0.015	-65.759	0.000	-1.046	-0.985
ZC_98199	-0.2298	0.015	-15.348	0.000	-0.259	-0.200

Omnibus:	1370.410	Durbin-Watson:	1.997
Prob(Omnibus):	0.000	Jarque-Bera (JB):	6174.051
Skew:	-0.107	Prob(JB):	0.00
Kurtosis:	5.611	Cond. No.	116.

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Adding zipcode to my model increased the adjusted R-squared from 0.657 to 0.876. So zipcode can explain 21.9% of the variance in the price. That's a lot!

The adjusted r-squared of 0.876 in my final model is pretty good. All the p-values are less than 0.05. The biggest issue I see is the negative coefficients for 'bedrooms' and 'yr_built'. I think that indicates interactions between features that may not be obvious. I am keeping them in, using the p-values as justification.

Model Validation

# K-folds cross validation of my final model using negative mean squared error
linreg = LinearRegression()

X = final_df.drop(['price'], axis=1)
y = final_df.price

cv_results3 = cross_val_score(linreg, X, y, cv=10, scoring="neg_mean_squared_error")
cv_results3

array([-0.03428675, -0.0370451 , -0.03566962, -0.0357842 , -0.03316145,
       -0.0357037 , -0.03419218, -0.03669121, -0.03534284, -0.0309256 ])

np.mean(cv_results3)

-0.03488026597319434

# Coefficient of variation of cross validation results
abs(np.std(cv_results3)/np.mean(cv_results3))*100

4.942280141700788

# Using r-squared
linreg = LinearRegression()

X = final_df.drop(['price'], axis=1)
y = final_df.price

cv_results4 = cross_val_score(linreg, X, y, cv=10)
cv_results4

array([0.8759028 , 0.87644342, 0.86667069, 0.8741245 , 0.8675731 ,
       0.87361936, 0.87786505, 0.87663803, 0.87589277, 0.86835936])

Adding zipcode also improved my cross validation score. And the results showed little variation.

# Train-test-split as another check on my model's performance.
y = final_df[["price"]]
X = final_df.drop(["price"], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

linreg = LinearRegression()
linreg.fit(X_train, y_train)

y_hat_train = linreg.predict(X_train)
y_hat_test = linreg.predict(X_test)
train_mse = mean_squared_error(y_train, y_hat_train)
test_mse = mean_squared_error(y_test, y_hat_test)
print('Train Mean Squarred Error:', train_mse)
print('Test Mean Squarred Error:', test_mse)

Train Mean Squarred Error: 0.03466335236355768
Test Mean Squarred Error: 0.03337063914965873

The test MSE and train MSE are very similar, giving me confidence that the model isn't overfit.

Interpret Results

Final Model Summary

For the final model used 12 independent variables (80 if you count the 69 dummy variables used for zipcode separately) and 21596 data points. The OLS regression model has an adjusted R-squared of 0.876. This gives me a fairly high level of confidence in my model to accurately predict housing prices. The remaining 0.124 could be influence of factors such as features that weren't included in the data set, sampling error, seasonal market fluctuations, or less tangeable factors like a seller's skills or bidding wars that bump up the price. The features that have the most effect on the sales price seem to be waterfront, zipcode, grade, sqft_living, and bathrooms.

The p-values of all my independent variables are less than 0.05 which shows that they all have greater than 95% chance that the coefficient isn't zero. Another factor that gives me confidence in the performance of my model is the k-folds cross validation. I had little variation across the resulting negative mean squared errors. (Average of -0.035 and a CV of 4.9%) I also performed a train-test-split validation which produced similar results. (Train: 0.035 and Test: 0.033) This shows I haven't over fit the model.

Interpreting Coefficients

Here is a closer look at some of the coefficients (the dependent variable, 'price', is log-transformed):

The min-max scaled and log-transformed independent variable 'sqft_living' has a coefficient of 1.3788. For a 1% increase/decrease in 'sqft_living' we expect about a 1.3788% increase/decrease in sales price with everything else remaining unchanged.

The categorical feature 'waterfront'has a coefficient of 0.6583. So if the property has a view of the water it increases the price about 93% with everything else remaining unchanged.

The min-max scaled feature 'grade' has a coefficient of 1.0339. If, for example, the grade is 4. Min-max scaled would make it 0.1. If the grade is 8, min-max scaled makes it 0.5. Going from a grade of 4 to a grade of 8 we would expect about a 36.4% increase in the sales price with everything else remaining unchanged.

cldugan / dsc-1-final-project-online-ds-sp-000

Final Project Submission

My Approach

Import Libraries

Obtain Data

Scrub / Explore Data

Correlation and Collinearity

How well are the variables correlated with the target and is there any collinearity to be worried about?

Scaling and Normalization

Did the log-transfomations improve distributions?

Do the features have a linear relationships with the target?

MinMax scaling the data

Modeling the Data

Running Ordinary Least Squares regression experiments in Statsmodels

Are the residuals consistent with a normal distribution?

Categorical Variables

Location, location, location

Does location affect price?

Model Validation

Interpret Results

Final Model Summary

Interpreting Coefficients

About

Languages