cldugan / dsc-1-final-project-online-ds-sp-000

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Final Project Submission

Please fill out:

My Approach

My goal was to model the King County Housing dataset with a multivariate linear regression in order to predict the sale price of houses as accurately as possible. First I put the data in a pandas dataframe and looked over the summary. A few columns that were obviously not useful for my model were deleted. Then I cleaned the data and removed independent variables that had too much missing data. I looked at correlation for feature collinearity. I also looked at histograms and scatterplots to get a better understanding of my data. I separated out features that were categorical to look at later while I explored the continuous independent variables. Next I scaled and normalized (where necessary) the data. I ran an OLS model in StatsModel and looked at performance and quality. I tried a few iterations to see if I could improve my model. Then I explored the categorical variables, one-hot encoded the one I thought would be appropriate and added it to my model. I ran a final OLS model and checked the performance with k-folds cross validation and test-train-split.

Import Libraries

# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import RFE
from sklearn import preprocessing
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.formula.api import ols
import scipy.stats as stats
plt.style.use('bmh')
%matplotlib inline

Obtain Data

kc_df = pd.read_csv("kc_house_data.csv")
kc_df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view ... grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
0 7129300520 10/13/2014 221900.0 3 1.00 1180 5650 1.0 NaN 0.0 ... 7 1180 0.0 1955 0.0 98178 47.5112 -122.257 1340 5650
1 6414100192 12/9/2014 538000.0 3 2.25 2570 7242 2.0 0.0 0.0 ... 7 2170 400.0 1951 1991.0 98125 47.7210 -122.319 1690 7639
2 5631500400 2/25/2015 180000.0 2 1.00 770 10000 1.0 0.0 0.0 ... 6 770 0.0 1933 NaN 98028 47.7379 -122.233 2720 8062
3 2487200875 12/9/2014 604000.0 4 3.00 1960 5000 1.0 0.0 0.0 ... 7 1050 910.0 1965 0.0 98136 47.5208 -122.393 1360 5000
4 1954400510 2/18/2015 510000.0 3 2.00 1680 8080 1.0 0.0 0.0 ... 8 1680 0.0 1987 0.0 98074 47.6168 -122.045 1800 7503

5 rows × 21 columns

Scrub / Explore Data

I have identified columns to delete before further exploration of the data.
id- will have no bearing on house value
date- not useful for linear regression

kc_df = kc_df.drop(['id', 'date'], axis=1)
kc_df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
0 221900.0 3 1.00 1180 5650 1.0 NaN 0.0 3 7 1180 0.0 1955 0.0 98178 47.5112 -122.257 1340 5650
1 538000.0 3 2.25 2570 7242 2.0 0.0 0.0 3 7 2170 400.0 1951 1991.0 98125 47.7210 -122.319 1690 7639
2 180000.0 2 1.00 770 10000 1.0 0.0 0.0 3 6 770 0.0 1933 NaN 98028 47.7379 -122.233 2720 8062
3 604000.0 4 3.00 1960 5000 1.0 0.0 0.0 5 7 1050 910.0 1965 0.0 98136 47.5208 -122.393 1360 5000
4 510000.0 3 2.00 1680 8080 1.0 0.0 0.0 3 8 1680 0.0 1987 0.0 98074 47.6168 -122.045 1800 7503
kc_df.describe()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
count 2.159700e+04 21597.000000 21597.000000 21597.000000 2.159700e+04 21597.000000 19221.000000 21534.000000 21597.000000 21597.000000 21597.000000 21597.000000 17755.000000 21597.000000 21597.000000 21597.000000 21597.000000 21597.000000
mean 5.402966e+05 3.373200 2.115826 2080.321850 1.509941e+04 1.494096 0.007596 0.233863 3.409825 7.657915 1788.596842 1970.999676 83.636778 98077.951845 47.560093 -122.213982 1986.620318 12758.283512
std 3.673681e+05 0.926299 0.768984 918.106125 4.141264e+04 0.539683 0.086825 0.765686 0.650546 1.173200 827.759761 29.375234 399.946414 53.513072 0.138552 0.140724 685.230472 27274.441950
min 7.800000e+04 1.000000 0.500000 370.000000 5.200000e+02 1.000000 0.000000 0.000000 1.000000 3.000000 370.000000 1900.000000 0.000000 98001.000000 47.155900 -122.519000 399.000000 651.000000
25% 3.220000e+05 3.000000 1.750000 1430.000000 5.040000e+03 1.000000 0.000000 0.000000 3.000000 7.000000 1190.000000 1951.000000 0.000000 98033.000000 47.471100 -122.328000 1490.000000 5100.000000
50% 4.500000e+05 3.000000 2.250000 1910.000000 7.618000e+03 1.500000 0.000000 0.000000 3.000000 7.000000 1560.000000 1975.000000 0.000000 98065.000000 47.571800 -122.231000 1840.000000 7620.000000
75% 6.450000e+05 4.000000 2.500000 2550.000000 1.068500e+04 2.000000 0.000000 0.000000 4.000000 8.000000 2210.000000 1997.000000 0.000000 98118.000000 47.678000 -122.125000 2360.000000 10083.000000
max 7.700000e+06 33.000000 8.000000 13540.000000 1.651359e+06 3.500000 1.000000 4.000000 5.000000 13.000000 9410.000000 2015.000000 2015.000000 98199.000000 47.777600 -121.315000 6210.000000 871200.000000

observations:

price- This is the dependent variable. House prices run 78,000 to 7,700,000. High max value and mean larger than median, data probably skewed.
bedrooms- 33 max value, possible outlier. closer look needed.
sqft_living- possible skew or outliers on the high end.
waterfront- categorical.
veiw- not sure this is useful.
yr_built and yr_renovated are dates, change to ages or min-max scaling will deal with issues?
zip code- will need to one-hot encode in order to use.
lat and long- interesting, but are they useful in a linear regression?

kc_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21597 entries, 0 to 21596
Data columns (total 19 columns):
price            21597 non-null float64
bedrooms         21597 non-null int64
bathrooms        21597 non-null float64
sqft_living      21597 non-null int64
sqft_lot         21597 non-null int64
floors           21597 non-null float64
waterfront       19221 non-null float64
view             21534 non-null float64
condition        21597 non-null int64
grade            21597 non-null int64
sqft_above       21597 non-null int64
sqft_basement    21597 non-null object
yr_built         21597 non-null int64
yr_renovated     17755 non-null float64
zipcode          21597 non-null int64
lat              21597 non-null float64
long             21597 non-null float64
sqft_living15    21597 non-null int64
sqft_lot15       21597 non-null int64
dtypes: float64(8), int64(10), object(1)
memory usage: 3.1+ MB

Observations-

waterfront, yr_renovated, and view have some missing data.
sqft_basement type indicates string which is weird and needs further examination.

kc_df.view.unique()
array([ 0., nan,  3.,  4.,  2.,  1.])
# deleting view because I don't think it has much influence on price, and it is mostly 0 or nan.
kc_df = kc_df.drop(['view'], axis=1)
# changing nan to 0 in waterfront because I think it is reasonable 
# to assume that if the house has water views it would be noted.

kc_df.waterfront.fillna(0, inplace=True)
kc_df.waterfront.unique()
array([0., 1.])
# examining sqft_basement because data type seems fishy. 
kc_df['sqft_basement'].unique()
array(['0.0', '400.0', '910.0', '1530.0', '?', '730.0', '1700.0', '300.0',
       '970.0', '760.0', '720.0', '700.0', '820.0', '780.0', '790.0',
       '330.0', '1620.0', '360.0', '588.0', '1510.0', '410.0', '990.0',
       '600.0', '560.0', '550.0', '1000.0', '1600.0', '500.0', '1040.0',
       '880.0', '1010.0', '240.0', '265.0', '290.0', '800.0', '540.0',
       '710.0', '840.0', '380.0', '770.0', '480.0', '570.0', '1490.0',
       '620.0', '1250.0', '1270.0', '120.0', '650.0', '180.0', '1130.0',
       '450.0', '1640.0', '1460.0', '1020.0', '1030.0', '750.0', '640.0',
       '1070.0', '490.0', '1310.0', '630.0', '2000.0', '390.0', '430.0',
       '850.0', '210.0', '1430.0', '1950.0', '440.0', '220.0', '1160.0',
       '860.0', '580.0', '2060.0', '1820.0', '1180.0', '200.0', '1150.0',
       '1200.0', '680.0', '530.0', '1450.0', '1170.0', '1080.0', '960.0',
       '280.0', '870.0', '1100.0', '460.0', '1400.0', '660.0', '1220.0',
       '900.0', '420.0', '1580.0', '1380.0', '475.0', '690.0', '270.0',
       '350.0', '935.0', '1370.0', '980.0', '1470.0', '160.0', '950.0',
       '50.0', '740.0', '1780.0', '1900.0', '340.0', '470.0', '370.0',
       '140.0', '1760.0', '130.0', '520.0', '890.0', '1110.0', '150.0',
       '1720.0', '810.0', '190.0', '1290.0', '670.0', '1800.0', '1120.0',
       '1810.0', '60.0', '1050.0', '940.0', '310.0', '930.0', '1390.0',
       '610.0', '1830.0', '1300.0', '510.0', '1330.0', '1590.0', '920.0',
       '1320.0', '1420.0', '1240.0', '1960.0', '1560.0', '2020.0',
       '1190.0', '2110.0', '1280.0', '250.0', '2390.0', '1230.0', '170.0',
       '830.0', '1260.0', '1410.0', '1340.0', '590.0', '1500.0', '1140.0',
       '260.0', '100.0', '320.0', '1480.0', '1060.0', '1284.0', '1670.0',
       '1350.0', '2570.0', '1090.0', '110.0', '2500.0', '90.0', '1940.0',
       '1550.0', '2350.0', '2490.0', '1481.0', '1360.0', '1135.0',
       '1520.0', '1850.0', '1660.0', '2130.0', '2600.0', '1690.0',
       '243.0', '1210.0', '1024.0', '1798.0', '1610.0', '1440.0',
       '1570.0', '1650.0', '704.0', '1910.0', '1630.0', '2360.0',
       '1852.0', '2090.0', '2400.0', '1790.0', '2150.0', '230.0', '70.0',
       '1680.0', '2100.0', '3000.0', '1870.0', '1710.0', '2030.0',
       '875.0', '1540.0', '2850.0', '2170.0', '506.0', '906.0', '145.0',
       '2040.0', '784.0', '1750.0', '374.0', '518.0', '2720.0', '2730.0',
       '1840.0', '3480.0', '2160.0', '1920.0', '2330.0', '1860.0',
       '2050.0', '4820.0', '1913.0', '80.0', '2010.0', '3260.0', '2200.0',
       '415.0', '1730.0', '652.0', '2196.0', '1930.0', '515.0', '40.0',
       '2080.0', '2580.0', '1548.0', '1740.0', '235.0', '861.0', '1890.0',
       '2220.0', '792.0', '2070.0', '4130.0', '2250.0', '2240.0',
       '1990.0', '768.0', '2550.0', '435.0', '1008.0', '2300.0', '2610.0',
       '666.0', '3500.0', '172.0', '1816.0', '2190.0', '1245.0', '1525.0',
       '1880.0', '862.0', '946.0', '1281.0', '414.0', '2180.0', '276.0',
       '1248.0', '602.0', '516.0', '176.0', '225.0', '1275.0', '266.0',
       '283.0', '65.0', '2310.0', '10.0', '1770.0', '2120.0', '295.0',
       '207.0', '915.0', '556.0', '417.0', '143.0', '508.0', '2810.0',
       '20.0', '274.0', '248.0'], dtype=object)
# Aha, there is "?" hiding in the data as placeholder.
(kc_df['sqft_basement']== '?').sum()
454

454 mising data points. I can either drop basement or drop the rows with '?'. I am going with dropping the column. Dropping rows would lose ~2% of the data.
Also I suspect it is colinear with other size features.

kc_df = kc_df.drop(['sqft_basement'], axis=1)
kc_df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
price bedrooms bathrooms sqft_living sqft_lot floors waterfront condition grade sqft_above yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
0 221900.0 3 1.00 1180 5650 1.0 0.0 3 7 1180 1955 0.0 98178 47.5112 -122.257 1340 5650
1 538000.0 3 2.25 2570 7242 2.0 0.0 3 7 2170 1951 1991.0 98125 47.7210 -122.319 1690 7639
2 180000.0 2 1.00 770 10000 1.0 0.0 3 6 770 1933 NaN 98028 47.7379 -122.233 2720 8062
3 604000.0 4 3.00 1960 5000 1.0 0.0 5 7 1050 1965 0.0 98136 47.5208 -122.393 1360 5000
4 510000.0 3 2.00 1680 8080 1.0 0.0 3 8 1680 1987 0.0 98074 47.6168 -122.045 1800 7503
kc_df.isna().sum()
price               0
bedrooms            0
bathrooms           0
sqft_living         0
sqft_lot            0
floors              0
waterfront          0
condition           0
grade               0
sqft_above          0
yr_built            0
yr_renovated     3842
zipcode             0
lat                 0
long                0
sqft_living15       0
sqft_lot15          0
dtype: int64
#Dropping yr_renovated, too much missing data.

kc_df = kc_df.drop('yr_renovated', axis = 1)
# lets see what these features look like.
kc_df.hist(figsize = (12,12));

png

Observations - Some data looks skewed, confirming earlier observations. Some data looks categorical, once data is cleaned I will look at linear relationships to see if I should treat as categorical or continuous

#dropping the row that has the 33 bedroom outier

kc_df = kc_df[kc_df.bedrooms != 33]
kc_df.describe()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
price bedrooms bathrooms sqft_living sqft_lot floors waterfront condition grade sqft_above yr_built zipcode lat long sqft_living15 sqft_lot15
count 2.159600e+04 21596.000000 21596.000000 21596.000000 2.159600e+04 21596.000000 21596.000000 21596.000000 21596.000000 21596.000000 21596.000000 21596.000000 21596.000000 21596.000000 21596.000000 21596.000000
mean 5.402920e+05 3.371828 2.115843 2080.343165 1.509983e+04 1.494119 0.006761 3.409752 7.657946 1788.631506 1971.000787 98077.950685 47.560087 -122.213977 1986.650722 12758.656649
std 3.673760e+05 0.904114 0.768998 918.122038 4.141355e+04 0.539685 0.081946 0.650471 1.173218 827.763251 29.375460 53.514040 0.138552 0.140725 685.231768 27275.018316
min 7.800000e+04 1.000000 0.500000 370.000000 5.200000e+02 1.000000 0.000000 1.000000 3.000000 370.000000 1900.000000 98001.000000 47.155900 -122.519000 399.000000 651.000000
25% 3.220000e+05 3.000000 1.750000 1430.000000 5.040000e+03 1.000000 0.000000 3.000000 7.000000 1190.000000 1951.000000 98033.000000 47.471100 -122.328000 1490.000000 5100.000000
50% 4.500000e+05 3.000000 2.250000 1910.000000 7.619000e+03 1.500000 0.000000 3.000000 7.000000 1560.000000 1975.000000 98065.000000 47.571800 -122.231000 1840.000000 7620.000000
75% 6.450000e+05 4.000000 2.500000 2550.000000 1.068550e+04 2.000000 0.000000 4.000000 8.000000 2210.000000 1997.000000 98118.000000 47.678000 -122.125000 2360.000000 10083.000000
max 7.700000e+06 11.000000 8.000000 13540.000000 1.651359e+06 3.500000 1.000000 5.000000 13.000000 9410.000000 2015.000000 98199.000000 47.777600 -121.315000 6210.000000 871200.000000
# lets take a closer look at continuous variables
column_list = ['price', 'bathrooms', 'bedrooms', 'condition', 'floors', 'grade', 'sqft_above', 'sqft_living', 'sqft_living15', 'sqft_lot', 'sqft_lot15','yr_built']
for col in column_list:
    sns.distplot(kc_df[col])
    plt.title(col)
    plt.show();

png

png

png

png

png

png

png

png

png

png

png

png

I think some of the variables are too skewed to use as-is. Some also have a lot of "peakedness". Some look categorical. I will try log-transformations and look at scatterplots to check for linear relationships with the target.

Some features are discrete and not continuous varibles. I am trying to decide whether to handle them as continuous or categorical varibles.

column_list = ['bathrooms', 'bedrooms', 'condition', 'floors', 'grade']
for col in column_list:
    f, ax = plt.subplots(figsize=(12,6))
    sns.violinplot(x = kc_df[col], y = kc_df['price'])
    plt.title(col)
    plt.show();

png

png

png

png

png

'grade' and 'bathrooms' seem to show the strongest positive correlations with price. I am suprised the other features don't show a stronger relationship. But nothing is screaming categorical to me at this stage so I will keep them in as continuous variables for now.

# Making a DF of cleaned features that I will normalize, scale, and then model.
# Dropping zipcode, lat, and long for now, these will have to be one-hot encoded if used.  
# Will explore later on and add back to model.

data_pred= kc_df.drop([ 'zipcode', 'lat', 'long'], axis =1)
data_pred.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
price bedrooms bathrooms sqft_living sqft_lot floors waterfront condition grade sqft_above yr_built sqft_living15 sqft_lot15
0 221900.0 3 1.00 1180 5650 1.0 0.0 3 7 1180 1955 1340 5650
1 538000.0 3 2.25 2570 7242 2.0 0.0 3 7 2170 1951 1690 7639
2 180000.0 2 1.00 770 10000 1.0 0.0 3 6 770 1933 2720 8062
3 604000.0 4 3.00 1960 5000 1.0 0.0 5 7 1050 1965 1360 5000
4 510000.0 3 2.00 1680 8080 1.0 0.0 3 8 1680 1987 1800 7503

Correlation and Collinearity

How well are the variables correlated with the target and is there any collinearity to be worried about?

# plotting heatmap of variables for correlation and collinearity
plt.figure(figsize=(12,10))
corr = abs(data_pred.corr())
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
    ax = sns.heatmap(corr, mask=mask, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap');

png

Target variable is 'price'. The features that are most highly correlated with 'price' are:
'sqft_living', 'grade', and 'sqft_above'. The least correlated: 'condition', 'yr_built', and 'sqft_lot15'.

Dropping 'sqft_above' because it is highly colinear with 'sqft_living', and will bias my model.
It also seems like they are measuring almost the same thing.
I am changing 'waterfront's type to category.

data_pred = data_pred.drop(['sqft_above'], axis=1)
data_pred['waterfront'] = data_pred.waterfront.astype('category')

Scaling and Normalization

# Looking at how skewed the data is.
data_pred.skew()
price             4.023329
bedrooms          0.551382
bathrooms         0.519644
sqft_living       1.473143
sqft_lot         13.072315
floors            0.614427
waterfront       12.039300
condition         1.036107
grade             0.788166
yr_built         -0.469549
sqft_living15     1.106828
sqft_lot15        9.524159
dtype: float64

I am log-transforming the obviously skewed data. (-1 < Skew < 1)
(Except 'waterfront' because it is now categorical.)

# updating my exploration df with transformed variables
data_pred["sqft_living"] = np.log(data_pred["sqft_living"])
data_pred["sqft_lot"] = np.log(data_pred["sqft_lot"])
data_pred["sqft_lot15"] = np.log(data_pred["sqft_lot15"])
data_pred["price"] = np.log(data_pred["price"])
data_pred.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
price bedrooms bathrooms sqft_living sqft_lot floors waterfront condition grade yr_built sqft_living15 sqft_lot15
0 12.309982 3 1.00 7.073270 8.639411 1.0 0.0 3 7 1955 1340 8.639411
1 13.195614 3 2.25 7.851661 8.887653 2.0 0.0 3 7 1951 1690 8.941022
2 12.100712 2 1.00 6.646391 9.210340 1.0 0.0 3 6 1933 2720 8.994917
3 13.311329 4 3.00 7.580700 8.517193 1.0 0.0 5 7 1965 1360 8.517193
4 13.142166 3 2.00 7.426549 8.997147 1.0 0.0 3 8 1987 1800 8.923058

Did the log-transfomations improve distributions?

# plots of transformed variables to see if improved by log-transformation.
column_list = ['price', 'sqft_living', 'sqft_living15', 'sqft_lot', 'sqft_lot15']
for col in column_list:
    sns.distplot(data_pred[col])
    plt.title(col)
    plt.show();

png

png

png

png

png

These features now have a more normal distibution. A bit "peaky" but I think good enough to move on to checking visually for linear relationships with the target.

Do the features have a linear relationships with the target?

# With jointplots we can look at the transformed histograms and KDE as well as linear relationships with target.
for column in data_pred.drop('price', axis=1):
    sns.jointplot(x=column, y='price',
                  data=data_pred, 
                  kind='reg',
                  space=0.0,
                  label=column,
                  joint_kws={'line_kws':{'color':'green'}})
    #plt.title("Price vs " + column)
    plt.legend()
    plt.show()

png

png

png

png

png

png

png

png

png

png

png

Looking at the histograms with density estimates, we can see that data that was log-transformed now have a less skewed distribution. With the scatterplots with the best-fit line drawn, I am satisfied that the independent variables have a good enough linear relationship with the target as to continue with modeling. The best linear relationships are with sqft_living and grade. I am going to treat the discrete data variables as continuous data (except for waterfront which is categorical) as they appear to have a linear relationship with the target.

MinMax scaling the data

# scaling the data. I am chosing to min-max scale the data so everything will be on the same scale of 0-1.  


data_minMax = data_pred.drop(['waterfront', 'price'], axis=1)

for column in data_minMax:
    data_minMax[column] = (data_minMax[column]-min(data_minMax[column]))/(max(data_minMax[column])-min(data_minMax[column]))

data_df = pd.concat([data_minMax, data_pred[['price','waterfront']]], axis=1)    
data_df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
bedrooms bathrooms sqft_living sqft_lot floors condition grade yr_built sqft_living15 sqft_lot15 price waterfront
0 0.2 0.066667 0.322166 0.295858 0.0 0.5 0.4 0.478261 0.161934 0.300162 12.309982 0.0
1 0.2 0.233333 0.538392 0.326644 0.4 0.5 0.4 0.443478 0.222165 0.342058 13.195614 0.0
2 0.1 0.066667 0.203585 0.366664 0.0 0.5 0.3 0.286957 0.399415 0.349544 12.100712 0.0
3 0.3 0.333333 0.463123 0.280700 0.0 1.0 0.4 0.565217 0.165376 0.283185 13.311329 0.0
4 0.2 0.200000 0.420302 0.340224 0.0 0.5 0.5 0.756522 0.241094 0.339562 13.142166 0.0
# sanity check

data_df.describe()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
bedrooms bathrooms sqft_living sqft_lot floors condition grade yr_built sqft_living15 sqft_lot15 price
count 21596.000000 21596.000000 21596.000000 21596.000000 21596.000000 21596.000000 21596.000000 21596.000000 21596.000000 21596.000000 21596.000000
mean 0.237183 0.215446 0.454797 0.339315 0.197648 0.602438 0.465795 0.617398 0.273215 0.344802 13.048196
std 0.090411 0.102533 0.117836 0.111877 0.215874 0.162618 0.117322 0.255439 0.117920 0.112878 0.526562
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 11.264464
25% 0.200000 0.166667 0.375546 0.281688 0.000000 0.500000 0.400000 0.443478 0.187747 0.285936 12.682307
50% 0.200000 0.233333 0.455945 0.332938 0.200000 0.500000 0.400000 0.652174 0.247978 0.341712 13.017003
75% 0.300000 0.266667 0.536222 0.374886 0.400000 0.750000 0.500000 0.843478 0.337463 0.380616 13.377006
max 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 15.856731
# setting waterfront as type category.
data_df['waterfront'] = data_df.waterfront.astype('category')
#making sure everything looks right
data_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 21596 entries, 0 to 21596
Data columns (total 12 columns):
bedrooms         21596 non-null float64
bathrooms        21596 non-null float64
sqft_living      21596 non-null float64
sqft_lot         21596 non-null float64
floors           21596 non-null float64
condition        21596 non-null float64
grade            21596 non-null float64
yr_built         21596 non-null float64
sqft_living15    21596 non-null float64
sqft_lot15       21596 non-null float64
price            21596 non-null float64
waterfront       21596 non-null category
dtypes: category(1), float64(11)
memory usage: 2.6 MB

Modeling the Data

Running Ordinary Least Squares regression experiments in Statsmodels

I am using StatsModels because the summay contains a lot of information and the layout is easy to read.

outcome = 'price'
predictors = data_df.drop(['price'], axis=1)
pred_sum = "+".join(predictors.columns)
formula = outcome + "~" + pred_sum
model = ols(formula= formula, data=data_df).fit()
model.summary()
OLS Regression Results
Dep. Variable: price R-squared: 0.658
Model: OLS Adj. R-squared: 0.657
Method: Least Squares F-statistic: 3767.
Date: Sun, 01 Mar 2020 Prob (F-statistic): 0.00
Time: 16:34:33 Log-Likelihood: -5221.1
No. Observations: 21596 AIC: 1.047e+04
Df Residuals: 21584 BIC: 1.056e+04
Df Model: 11
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 11.7254 0.015 766.698 0.000 11.695 11.755
waterfront[T.1.0] 0.5532 0.026 21.375 0.000 0.503 0.604
bedrooms -0.4225 0.031 -13.482 0.000 -0.484 -0.361
bathrooms 0.6303 0.036 17.288 0.000 0.559 0.702
sqft_living 1.2824 0.040 31.807 0.000 1.203 1.361
sqft_lot -0.1437 0.049 -2.956 0.003 -0.239 -0.048
floors 0.1158 0.013 8.900 0.000 0.090 0.141
condition 0.1716 0.014 12.188 0.000 0.144 0.199
grade 2.0690 0.031 65.887 0.000 2.007 2.131
yr_built -0.6877 0.011 -63.881 0.000 -0.709 -0.667
sqft_living15 0.7683 0.030 25.985 0.000 0.710 0.826
sqft_lot15 -0.3664 0.048 -7.646 0.000 -0.460 -0.272
Omnibus: 48.465 Durbin-Watson: 1.965
Prob(Omnibus): 0.000 Jarque-Bera (JB): 56.757
Skew: -0.056 Prob(JB): 4.74e-13
Kurtosis: 3.225 Cond. No. 51.7


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

An adjusted R-squared of 0.657 is pretty good, considering I still have data to add to the model.
All p-values are better than 0.05. Skew shows not much tailing. A kurtosis of 3.225 is close to expected value of 3 for normal distibution.
However, the high JB score tells me that the data may not be normally distributed. 'bedrooms' , sqft_lot', 'yr_built', and 'sqft_lot15' have a negative coefficient. something to keep an eye on.

Are the residuals consistent with a normal distribution?

# q-q plot to visualize the residuals

residuals = model.resid
fig = sm.graphics.qqplot(residuals, dist=stats.norm, line='45', fit=True)
plt.title("Q-Q plot of Residuals");

png

The Q-Q plot shows some peakyness or tailing perhaps, but for the most part seems close to a normal distibution. Real life data
won't be perfect!

# Checking model performance with cross-validaton
linreg = LinearRegression()

X = data_df.drop(['price'], axis=1)
y = data_df.price

cv_results = cross_val_score(linreg, X, y, cv=10, scoring="neg_mean_squared_error")
cv_results
array([-0.09485161, -0.09953494, -0.09634111, -0.09786468, -0.09079897,
       -0.09538301, -0.09304397, -0.10168181, -0.09753282, -0.09143151])
np.mean(cv_results)
-0.09584644256982355

Results look pretty consistent. A good sign.

# Running recursive feature elimination to look for candidate variables to drop to see if I can improve model.

predictors = data_df.drop(['price', 'waterfront'], axis=1)

linreg = LinearRegression()
selector = RFE(linreg, n_features_to_select = 1)
selector = selector.fit(predictors, data_df["price"])
list(zip(predictors.columns,selector.ranking_))
[('bedrooms', 7),
 ('bathrooms', 3),
 ('sqft_living', 2),
 ('sqft_lot', 8),
 ('floors', 10),
 ('condition', 9),
 ('grade', 1),
 ('yr_built', 4),
 ('sqft_living15', 5),
 ('sqft_lot15', 6)]
# dropping floors (the worst ranked) to see if it improves model. 

data_dr = data_df.drop(['floors'], axis=1)
# no floors
outcome = 'price'
predictors1 = data_dr.drop(['price'], axis=1)
pred_sum1 = "+".join(predictors1.columns)
formula1 = outcome + "~" + pred_sum1
model1 = ols(formula= formula1, data=data_dr).fit()
model1.summary()
OLS Regression Results
Dep. Variable: price R-squared: 0.656
Model: OLS Adj. R-squared: 0.656
Method: Least Squares F-statistic: 4121.
Date: Sun, 01 Mar 2020 Prob (F-statistic): 0.00
Time: 16:36:48 Log-Likelihood: -5260.7
No. Observations: 21596 AIC: 1.054e+04
Df Residuals: 21585 BIC: 1.063e+04
Df Model: 10
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 11.7295 0.015 765.928 0.000 11.700 11.760
waterfront[T.1.0] 0.5575 0.026 21.506 0.000 0.507 0.608
bedrooms -0.4345 0.031 -13.853 0.000 -0.496 -0.373
bathrooms 0.6855 0.036 19.048 0.000 0.615 0.756
sqft_living 1.3036 0.040 32.330 0.000 1.225 1.383
sqft_lot -0.1902 0.048 -3.928 0.000 -0.285 -0.095
condition 0.1570 0.014 11.206 0.000 0.130 0.184
grade 2.1125 0.031 67.984 0.000 2.052 2.173
yr_built -0.6655 0.010 -63.435 0.000 -0.686 -0.645
sqft_living15 0.7643 0.030 25.805 0.000 0.706 0.822
sqft_lot15 -0.3904 0.048 -8.145 0.000 -0.484 -0.296
Omnibus: 49.190 Durbin-Watson: 1.962
Prob(Omnibus): 0.000 Jarque-Bera (JB): 56.684
Skew: -0.063 Prob(JB): 4.91e-13
Kurtosis: 3.217 Cond. No. 51.3


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
linreg = LinearRegression()

X = data_dr.drop(['price'], axis=1)
y = data_dr.price

cv_results1 = cross_val_score(linreg, X, y, cv=10, scoring="neg_mean_squared_error")
np.mean(cv_results1)
-0.0962495949425783

Dropping that variable didn't improve my model (actually slightly hurt it).

# seeing if model improves when the categorical variable 'waterfront' is removed.  It is mostly 0's.
data_dr_water = data_df.drop('waterfront', axis=1)

outcome = 'price'
predictors2 = data_dr_water.drop(['price'], axis=1)
pred_sum2 = "+".join(predictors2.columns)
formula2 = outcome + "~" + pred_sum2
model2 = ols(formula= formula2, data=data_dr_water).fit()
model2.summary()
OLS Regression Results
Dep. Variable: price R-squared: 0.650
Model: OLS Adj. R-squared: 0.650
Method: Least Squares F-statistic: 4013.
Date: Sun, 01 Mar 2020 Prob (F-statistic): 0.00
Time: 16:42:54 Log-Likelihood: -5447.3
No. Observations: 21596 AIC: 1.092e+04
Df Residuals: 21585 BIC: 1.100e+04
Df Model: 10
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 11.7157 0.015 758.434 0.000 11.685 11.746
bedrooms -0.4667 0.032 -14.769 0.000 -0.529 -0.405
bathrooms 0.6622 0.037 17.991 0.000 0.590 0.734
sqft_living 1.2975 0.041 31.853 0.000 1.218 1.377
sqft_lot -0.1571 0.049 -3.197 0.001 -0.253 -0.061
floors 0.1209 0.013 9.204 0.000 0.095 0.147
condition 0.1724 0.014 12.118 0.000 0.145 0.200
grade 2.0878 0.032 65.822 0.000 2.026 2.150
yr_built -0.7056 0.011 -65.053 0.000 -0.727 -0.684
sqft_living15 0.7740 0.030 25.906 0.000 0.715 0.833
sqft_lot15 -0.3259 0.048 -6.737 0.000 -0.421 -0.231
Omnibus: 50.905 Durbin-Watson: 1.967
Prob(Omnibus): 0.000 Jarque-Bera (JB): 64.257
Skew: -0.015 Prob(JB): 1.11e-14
Kurtosis: 3.266 Cond. No. 51.7


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
linreg = LinearRegression()

X = data_dr.drop(['price','waterfront'], axis=1)
y = data_dr.price

cv_results2 = cross_val_score(linreg, X, y, cv=10, scoring="neg_mean_squared_error")
np.mean(cv_results2)
-0.09832069899904562

Removing waterfront hurts my model. I will keep the original model and move on with examining lat, long, and zipcode.

Categorical Variables

Location, location, location

I have three independent variables left to examine; lat, long, and zipcode. These all are geographical data indicating where the houses are located. I will look at each one to see how best to add to my model.

# How many unique values?
kc_df.zipcode.nunique()
70
kc_df.lat.nunique()
5033
kc_df.long.nunique()
751

Does location affect price?

# Scatter plot of long and lat color mapped to log-transformed price data.

kc_df.plot(kind="scatter", x="long", y="lat", alpha=0.4, figsize=(16,10),
    c=data_pred["price"], cmap="rainbow", colorbar=True,
    sharex=False)
plt.title("Location and Log-Transformed Price")
plt.show()

png

Obviously location affects price! You can see how the house prices are higher in certain areas.

# Looking at a hexbin plot for density of locations (just curious)
kc_df.plot.hexbin(x='long', y='lat', figsize=(12,8))
plt.title("Housing Density");

png

Moving on to look at zipcode

kc_df['zipcode'] = kc_df.zipcode.astype('int')
kc_df.plot(kind="scatter", x="long", y="lat", alpha=0.6, figsize=(12,8),
    c='zipcode', cmap="rainbow", colorbar=True,
    sharex=False)
plt.title("Zipcode Locations");

png

I don't see any way to easily group zipcodes. I will do some further exploration, but will probably end up one-hot encoding the feature into categories.

The lat and long gave me a nice graph that showed the importance of location on price, but I don't see a way to easily use in my linear regression model. I am thinking using lat and long and zipcode would be redundant anyway, so I will use zipcode in my model.

# looking at Zipcode
kc_df.zipcode.hist(bins = 70, figsize=(10,10), label='zipcode')
plt.legend()
plt.title('Zipcode');

png

Not normally distributed at all. Defininitely categorical.

kc_df['zipcode'].value_counts()
98103    601
98038    589
98115    583
98052    574
98117    553
        ... 
98102    104
98010    100
98024     80
98148     57
98039     50
Name: zipcode, Length: 70, dtype: int64
# One-hot encoding 'zipcode' variable and adding to my model.

kc_df['zipcode'] = kc_df.zipcode.astype('str')
zip_dummy = pd.get_dummies(kc_df.zipcode, prefix = 'ZC')

final_df = pd.concat([data_df, zip_dummy], axis=1)
# Checking data to see if everything is how I want it.
final_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 21596 entries, 0 to 21596
Data columns (total 82 columns):
bedrooms         21596 non-null float64
bathrooms        21596 non-null float64
sqft_living      21596 non-null float64
sqft_lot         21596 non-null float64
floors           21596 non-null float64
condition        21596 non-null float64
grade            21596 non-null float64
yr_built         21596 non-null float64
sqft_living15    21596 non-null float64
sqft_lot15       21596 non-null float64
price            21596 non-null float64
waterfront       21596 non-null category
ZC_98001         21596 non-null uint8
ZC_98002         21596 non-null uint8
ZC_98003         21596 non-null uint8
ZC_98004         21596 non-null uint8
ZC_98005         21596 non-null uint8
ZC_98006         21596 non-null uint8
ZC_98007         21596 non-null uint8
ZC_98008         21596 non-null uint8
ZC_98010         21596 non-null uint8
ZC_98011         21596 non-null uint8
ZC_98014         21596 non-null uint8
ZC_98019         21596 non-null uint8
ZC_98022         21596 non-null uint8
ZC_98023         21596 non-null uint8
ZC_98024         21596 non-null uint8
ZC_98027         21596 non-null uint8
ZC_98028         21596 non-null uint8
ZC_98029         21596 non-null uint8
ZC_98030         21596 non-null uint8
ZC_98031         21596 non-null uint8
ZC_98032         21596 non-null uint8
ZC_98033         21596 non-null uint8
ZC_98034         21596 non-null uint8
ZC_98038         21596 non-null uint8
ZC_98039         21596 non-null uint8
ZC_98040         21596 non-null uint8
ZC_98042         21596 non-null uint8
ZC_98045         21596 non-null uint8
ZC_98052         21596 non-null uint8
ZC_98053         21596 non-null uint8
ZC_98055         21596 non-null uint8
ZC_98056         21596 non-null uint8
ZC_98058         21596 non-null uint8
ZC_98059         21596 non-null uint8
ZC_98065         21596 non-null uint8
ZC_98070         21596 non-null uint8
ZC_98072         21596 non-null uint8
ZC_98074         21596 non-null uint8
ZC_98075         21596 non-null uint8
ZC_98077         21596 non-null uint8
ZC_98092         21596 non-null uint8
ZC_98102         21596 non-null uint8
ZC_98103         21596 non-null uint8
ZC_98105         21596 non-null uint8
ZC_98106         21596 non-null uint8
ZC_98107         21596 non-null uint8
ZC_98108         21596 non-null uint8
ZC_98109         21596 non-null uint8
ZC_98112         21596 non-null uint8
ZC_98115         21596 non-null uint8
ZC_98116         21596 non-null uint8
ZC_98117         21596 non-null uint8
ZC_98118         21596 non-null uint8
ZC_98119         21596 non-null uint8
ZC_98122         21596 non-null uint8
ZC_98125         21596 non-null uint8
ZC_98126         21596 non-null uint8
ZC_98133         21596 non-null uint8
ZC_98136         21596 non-null uint8
ZC_98144         21596 non-null uint8
ZC_98146         21596 non-null uint8
ZC_98148         21596 non-null uint8
ZC_98155         21596 non-null uint8
ZC_98166         21596 non-null uint8
ZC_98168         21596 non-null uint8
ZC_98177         21596 non-null uint8
ZC_98178         21596 non-null uint8
ZC_98188         21596 non-null uint8
ZC_98198         21596 non-null uint8
ZC_98199         21596 non-null uint8
dtypes: category(1), float64(11), uint8(70)
memory usage: 4.1 MB
# dropping one of the zipcode columns

final_df = final_df.drop('ZC_98004', axis =1)
# Modeling dataset with multivariate linear regression.
outcome = 'price'
predictors_fin = final_df.drop(['price'], axis=1)
pred_sum_fin = "+".join(predictors_fin.columns)
formula_fin = outcome + "~" + pred_sum_fin
model_fin = ols(formula= formula_fin, data=final_df).fit()
model_fin.summary()
OLS Regression Results
Dep. Variable: price R-squared: 0.876
Model: OLS Adj. R-squared: 0.876
Method: Least Squares F-statistic: 1900.
Date: Sun, 01 Mar 2020 Prob (F-statistic): 0.00
Time: 16:45:18 Log-Likelihood: 5749.7
No. Observations: 21596 AIC: -1.134e+04
Df Residuals: 21515 BIC: -1.069e+04
Df Model: 80
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 12.1556 0.016 775.067 0.000 12.125 12.186
waterfront[T.1.0] 0.6583 0.016 41.414 0.000 0.627 0.689
bedrooms -0.1885 0.019 -9.805 0.000 -0.226 -0.151
bathrooms 0.3335 0.022 15.025 0.000 0.290 0.377
sqft_living 1.3788 0.024 56.282 0.000 1.331 1.427
sqft_lot 0.6374 0.030 21.233 0.000 0.579 0.696
floors 0.0325 0.008 3.955 0.000 0.016 0.049
condition 0.1875 0.009 21.423 0.000 0.170 0.205
grade 1.0339 0.020 50.846 0.000 0.994 1.074
yr_built -0.0830 0.008 -9.992 0.000 -0.099 -0.067
sqft_living15 0.5665 0.019 29.827 0.000 0.529 0.604
sqft_lot15 -0.1606 0.030 -5.417 0.000 -0.219 -0.102
ZC_98001 -1.1096 0.015 -76.305 0.000 -1.138 -1.081
ZC_98002 -1.1035 0.017 -64.573 0.000 -1.137 -1.070
ZC_98003 -1.0927 0.015 -71.045 0.000 -1.123 -1.063
ZC_98005 -0.4173 0.018 -23.519 0.000 -0.452 -0.383
ZC_98006 -0.4820 0.013 -36.050 0.000 -0.508 -0.456
ZC_98007 -0.4771 0.019 -25.294 0.000 -0.514 -0.440
ZC_98008 -0.4497 0.015 -29.402 0.000 -0.480 -0.420
ZC_98010 -0.8701 0.022 -40.265 0.000 -0.912 -0.828
ZC_98011 -0.6794 0.017 -39.961 0.000 -0.713 -0.646
ZC_98014 -0.8082 0.020 -40.051 0.000 -0.848 -0.769
ZC_98019 -0.7985 0.017 -46.039 0.000 -0.832 -0.764
ZC_98022 -1.0278 0.016 -62.777 0.000 -1.060 -0.996
ZC_98023 -1.1444 0.013 -84.799 0.000 -1.171 -1.118
ZC_98024 -0.6904 0.024 -29.229 0.000 -0.737 -0.644
ZC_98027 -0.6202 0.014 -44.335 0.000 -0.648 -0.593
ZC_98028 -0.7035 0.015 -45.945 0.000 -0.733 -0.673
ZC_98029 -0.5153 0.015 -34.582 0.000 -0.545 -0.486
ZC_98030 -1.0611 0.016 -67.117 0.000 -1.092 -1.030
ZC_98031 -1.0467 0.016 -67.396 0.000 -1.077 -1.016
ZC_98032 -1.1327 0.020 -57.270 0.000 -1.171 -1.094
ZC_98033 -0.3207 0.014 -23.236 0.000 -0.348 -0.294
ZC_98034 -0.5660 0.013 -42.614 0.000 -0.592 -0.540
ZC_98038 -0.9435 0.013 -71.255 0.000 -0.969 -0.917
ZC_98039 0.1698 0.028 5.996 0.000 0.114 0.225
ZC_98040 -0.2411 0.015 -15.818 0.000 -0.271 -0.211
ZC_98042 -1.0480 0.013 -78.388 0.000 -1.074 -1.022
ZC_98045 -0.7847 0.017 -47.180 0.000 -0.817 -0.752
ZC_98052 -0.4951 0.013 -37.861 0.000 -0.521 -0.469
ZC_98053 -0.5381 0.014 -37.952 0.000 -0.566 -0.510
ZC_98055 -0.9589 0.016 -61.434 0.000 -0.990 -0.928
ZC_98056 -0.7780 0.014 -55.091 0.000 -0.806 -0.750
ZC_98058 -0.9558 0.014 -69.500 0.000 -0.983 -0.929
ZC_98059 -0.7780 0.014 -56.945 0.000 -0.805 -0.751
ZC_98065 -0.6962 0.015 -46.067 0.000 -0.726 -0.667
ZC_98070 -0.7864 0.021 -37.660 0.000 -0.827 -0.745
ZC_98072 -0.6629 0.015 -42.806 0.000 -0.693 -0.633
ZC_98074 -0.5790 0.014 -42.117 0.000 -0.606 -0.552
ZC_98075 -0.5751 0.014 -39.906 0.000 -0.603 -0.547
ZC_98077 -0.7214 0.017 -42.178 0.000 -0.755 -0.688
ZC_98092 -1.0924 0.015 -74.832 0.000 -1.121 -1.064
ZC_98102 -0.1388 0.021 -6.482 0.000 -0.181 -0.097
ZC_98103 -0.2641 0.014 -19.545 0.000 -0.291 -0.238
ZC_98105 -0.1567 0.016 -9.527 0.000 -0.189 -0.124
ZC_98106 -0.7337 0.015 -49.121 0.000 -0.763 -0.704
ZC_98107 -0.2353 0.016 -14.759 0.000 -0.266 -0.204
ZC_98108 -0.7362 0.017 -42.266 0.000 -0.770 -0.702
ZC_98109 -0.0950 0.021 -4.523 0.000 -0.136 -0.054
ZC_98112 -0.0725 0.016 -4.592 0.000 -0.103 -0.042
ZC_98115 -0.2788 0.013 -20.967 0.000 -0.305 -0.253
ZC_98116 -0.3103 0.015 -20.784 0.000 -0.340 -0.281
ZC_98117 -0.2762 0.014 -20.441 0.000 -0.303 -0.250
ZC_98118 -0.6203 0.014 -45.474 0.000 -0.647 -0.594
ZC_98119 -0.1012 0.018 -5.730 0.000 -0.136 -0.067
ZC_98122 -0.2789 0.016 -17.879 0.000 -0.310 -0.248
ZC_98125 -0.5284 0.014 -37.362 0.000 -0.556 -0.501
ZC_98126 -0.5174 0.015 -35.092 0.000 -0.546 -0.488
ZC_98133 -0.6402 0.014 -46.879 0.000 -0.667 -0.613
ZC_98136 -0.3872 0.016 -24.552 0.000 -0.418 -0.356
ZC_98144 -0.4040 0.015 -27.193 0.000 -0.433 -0.375
ZC_98146 -0.7953 0.015 -51.629 0.000 -0.825 -0.765
ZC_98148 -0.9436 0.027 -35.110 0.000 -0.996 -0.891
ZC_98155 -0.6741 0.014 -48.689 0.000 -0.701 -0.647
ZC_98166 -0.7819 0.016 -49.471 0.000 -0.813 -0.751
ZC_98168 -1.0264 0.016 -65.208 0.000 -1.057 -0.996
ZC_98177 -0.4912 0.016 -31.329 0.000 -0.522 -0.460
ZC_98178 -0.9335 0.016 -59.218 0.000 -0.964 -0.903
ZC_98188 -1.0007 0.019 -52.055 0.000 -1.038 -0.963
ZC_98198 -1.0156 0.015 -65.759 0.000 -1.046 -0.985
ZC_98199 -0.2298 0.015 -15.348 0.000 -0.259 -0.200
Omnibus: 1370.410 Durbin-Watson: 1.997
Prob(Omnibus): 0.000 Jarque-Bera (JB): 6174.051
Skew: -0.107 Prob(JB): 0.00
Kurtosis: 5.611 Cond. No. 116.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Adding zipcode to my model increased the adjusted R-squared from 0.657 to 0.876. So zipcode can explain 21.9% of the variance in the price. That's a lot!

The adjusted r-squared of 0.876 in my final model is pretty good. All the p-values are less than 0.05. The biggest issue I see is the negative coefficients for 'bedrooms' and 'yr_built'. I think that indicates interactions between features that may not be obvious. I am keeping them in, using the p-values as justification.

Model Validation

# K-folds cross validation of my final model using negative mean squared error
linreg = LinearRegression()

X = final_df.drop(['price'], axis=1)
y = final_df.price

cv_results3 = cross_val_score(linreg, X, y, cv=10, scoring="neg_mean_squared_error")
cv_results3
array([-0.03428675, -0.0370451 , -0.03566962, -0.0357842 , -0.03316145,
       -0.0357037 , -0.03419218, -0.03669121, -0.03534284, -0.0309256 ])
np.mean(cv_results3)
-0.03488026597319434
# Coefficient of variation of cross validation results
abs(np.std(cv_results3)/np.mean(cv_results3))*100
4.942280141700788
# Using r-squared
linreg = LinearRegression()

X = final_df.drop(['price'], axis=1)
y = final_df.price

cv_results4 = cross_val_score(linreg, X, y, cv=10)
cv_results4
array([0.8759028 , 0.87644342, 0.86667069, 0.8741245 , 0.8675731 ,
       0.87361936, 0.87786505, 0.87663803, 0.87589277, 0.86835936])

Adding zipcode also improved my cross validation score. And the results showed little variation.

# Train-test-split as another check on my model's performance.
y = final_df[["price"]]
X = final_df.drop(["price"], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

linreg = LinearRegression()
linreg.fit(X_train, y_train)

y_hat_train = linreg.predict(X_train)
y_hat_test = linreg.predict(X_test)
train_mse = mean_squared_error(y_train, y_hat_train)
test_mse = mean_squared_error(y_test, y_hat_test)
print('Train Mean Squarred Error:', train_mse)
print('Test Mean Squarred Error:', test_mse)
Train Mean Squarred Error: 0.03466335236355768
Test Mean Squarred Error: 0.03337063914965873

The test MSE and train MSE are very similar, giving me confidence that the model isn't overfit.

Interpret Results

Final Model Summary

For the final model used 12 independent variables (80 if you count the 69 dummy variables used for zipcode separately) and 21596 data points. The OLS regression model has an adjusted R-squared of 0.876. This gives me a fairly high level of confidence in my model to accurately predict housing prices. The remaining 0.124 could be influence of factors such as features that weren't included in the data set, sampling error, seasonal market fluctuations, or less tangeable factors like a seller's skills or bidding wars that bump up the price. The features that have the most effect on the sales price seem to be waterfront, zipcode, grade, sqft_living, and bathrooms.

The p-values of all my independent variables are less than 0.05 which shows that they all have greater than 95% chance that the coefficient isn't zero. Another factor that gives me confidence in the performance of my model is the k-folds cross validation. I had little variation across the resulting negative mean squared errors. (Average of -0.035 and a CV of 4.9%) I also performed a train-test-split validation which produced similar results. (Train: 0.035 and Test: 0.033) This shows I haven't over fit the model.

Interpreting Coefficients

Here is a closer look at some of the coefficients (the dependent variable, 'price', is log-transformed):

The min-max scaled and log-transformed independent variable 'sqft_living' has a coefficient of 1.3788. For a 1% increase/decrease in 'sqft_living' we expect about a 1.3788% increase/decrease in sales price with everything else remaining unchanged.

The categorical feature 'waterfront'has a coefficient of 0.6583. So if the property has a view of the water it increases the price about 93% with everything else remaining unchanged.

The min-max scaled feature 'grade' has a coefficient of 1.0339. If, for example, the grade is 4. Min-max scaled would make it 0.1. If the grade is 8, min-max scaled makes it 0.5. Going from a grade of 4 to a grade of 8 we would expect about a 36.4% increase in the sales price with everything else remaining unchanged.

About

License:Other


Languages

Language:Jupyter Notebook 100.0%