m3redithw / Home-Value-Prediction

Using machine learning regression model to predict property tax assessed value

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Zillow Home Value Prediction by Meredith Wang

Python Pandas NumPy Matplotlib seaborn plotly sklearn SciPy GeoPandas

logo

The current property buying or selling is hectic and expensive. Efficiently predicting the property pricing for real estate customers with respect to their budgets and priorities is essential.

In this project, we will use statistical analysis to analyze the key drivers of perperty value for single family properties, develop a ML regression model to predict property tax assessed value, and provide recommendations on making better homes' values prediction.

🏠 Project Goals

▪️ Find the key drivers of property value for single family properties in 2017.

▪️ Construct an ML Regression model that predict propery tax assessed values ('assessed_value') of Single Family Properties using attributes of the properties.

▪️ Deliver a report that the data science team can read through and replicate, understand what steps were taken, why and what the outcome was.

▪️ Make recommendations on what works or doesn't work in prediction these homes' values.

📝 Initial Questions

▪️ What associated with assessed_value the most?

▪️ Is having pool positively associated with assessed value?

▪️ Is number of bedrooms and bathrooms associated with assessed value?

▪️ Is assessed_value significantly different across 3 different counties?

📂 Data Dictionary

Variable Value Meaning
Latitude Float number Latitude of the middle of the parcel
Longitude Float number Longitude of the middle of the parcel
Bedrooms Integer ranging from 1-6 Number of bedrooms in home
Bathrooms Float ranging from 0.5-6.5 Number of bathrooms in home including fractional bathrooms
Square Feet Float number Calculated total finished living area of the home
Lot Size Float number Area of the lot in square feet
Age Integer This indicate the age of the property in 2017, calculated using the year the principal residence was built
Assessed Value Float number The total tax assessed value of the parcel
Tax Amount Float number The total property tax assessed for that assessment year
County 1) Ventura 2) Los Angeles 3) Orange County in which the property is located

📊 Data Context

We have data of 47128 single family properties. 5 categorical variables and 5 numerical variables. data_context

🪧 Project Plan / Process

1️⃣ Data Acquisition

Gather data from mySQL database
  • Create env.py file to establish connection to mySQL server

  • Use zillow database in the mySQL server

  • Read data dictionary and extract meaningful columns

  • Write query to join useful tables to gather all data about the houses in the region: properties_2017, predictions_2017, propertylandusetype

    SELECT 
    CONCAT(SUBSTRING(longitude, 1, 4),
                   ',',
                   SUBSTRING(longitude, 5, 10)) as longitude,
    CONCAT(SUBSTRING(latitude, 1, 2),
                   ',',
                   SUBSTRING(latitude, 3, 10)) as latitude,
    bedroomcnt AS bedrooms,
    bathroomcnt AS bathrooms,
    calculatedfinishedsquarefeet AS square_feet,
    lotsizesquarefeet AS lot_size,
    poolcnt AS has_pool,
    CONCAT ('0',fips) AS fips_code,
    (2017 - yearbuilt) AS age,
    taxvaluedollarcnt AS assessed_value,
    taxamount AS tax_amount
    FROM
       properties_2017 AS p
           JOIN
       predictions_2017 AS pred USING (parcelid)
           JOIN
       propertylandusetype AS ptype USING (propertylandusetypeid)
    WHERE
       ptype.propertylandusedesc LIKE '%%Single%%'
           AND pred.transactiondate LIKE '2017%%';
acqure.py
  • Create acquire.py and user-defined function get_zillow_data() to gather data from mySQL

    def get_zillow_data():
    
    if os.path.isfile('zillow.csv'):
       df = pd.read_csv('zillow.csv', index_col=0)
    else:
       df = new_zillow_data()
       df.to_csv('zillow.csv')
       
    return df
  • Import acquire.py

  • Test acquire function

  • Calling the function, and store the table in the form of dataframe

    df = acquire.get_zillow_data()

2️⃣ Data Preparation

Data Cleaning
  • Missing values:

    • Null values for has_pool column is replaced with 0
      df.has_pool = df.has_pool.replace(np.nan, 0)
    • Other null values are dropped
      df = df.dropna()
  • Data types: float is converted to int datatype

    df['fips_code'] = df['fips_code'].astype(int)
    df['age'] = df['age'].astype(int)
  • Data mapping

    • created new county column with county name corresponding to fips_code

    • created new bins bedrooms_size and bathrooms_size for bedrooms and bathrooms

         df['bedrooms_size'] = pd.cut(df.bedrooms, bins = [0,2,4,6],
                        labels = ['small', 'medium', 'large'])
         df['bathrooms_size'] = pd.cut(df.bathrooms, bins = [0,2.5,4.5,6.5],
                        labels = ['small', 'medium', 'large'])
      
  • Dummy variables:

    • Created dummy variables for categorical feature county, bedrooms_size, bathrooms_size
    • Concatenated all dummy variables onto original dataframe
  • Outliers

    • General rull for handling outliers:

      • Upper bond: Q3 + 1.5 * IQR
      • Lower bond: Q1 - 1.5 * IQR

      Note: each feature has minor adjustment based on data distribution

    • Outliers for each feature are dropped

      df = df[df.bedrooms <= 6]
      df = df[df.bedrooms >= 1]
      
      df = df[df.bathrooms <= 6.5]
      df = df[df.bathrooms >= 0.5]
      
      df = df[df.square_feet <= 7982]
      df = df[df.square_feet >= 493]
      
      df = df[df.lot_size <= 152597]
      df = df[df.lot_size >= 787]
      
      df = df[df.assessed_value <= 2520956]
      df = df[df.assessed_value >= 45366]
  • Create function prep_zillow to clean and prepare data with steps above

  • Import prepare.py

  • Test prepare function

  • Call the function, and store the cleaned data in the form of dataframe

Data Splitting
  • Create function split() to split data into train, validate, test

  • Test split function

  • Check the size of each dataset

    train.shape, validate.shape, test.shape
  • Call the function, and store the 3 data samples separately in the form of dataframe

    train, validate, test = prepare.split(df)

3️⃣ Exploratory Analysis

  • Ask questions to find what are the key features that are associated with property assessed value

  • Explore each feature's correlation with assessed value

  • Using visualizations to better understand the relationship between features

4️⃣ Statistical Testing & Modeling

  • Conduct T-Test for categorical variable vs. numerical variable

  • Conduct Chi^2 Test for categorical variable vs. categorical variable

  • Conclude hypothesis and address the initial questions

5️⃣ Modeling Evaluation

  • Create multiple regression model and use Recursive Feature Elimination (RFE) to select features

  • Find the amount of features that can gerenate the highest performance (evaluated using Root Mean Squared Error)

  • Generate polynomial model, fit and tranform the train dataset into feature

  • Find the degree that generates the best performing model (evaluated using RMSE)

  • Create lasso-lars model object, fit the model to our training data, and use the model to make predictions

  • Create generalized linear model (TweedieRegressor) and fit train dataset

  • Pick the top 3 models among all the models and evaluate performance on validate dataset

  • Stored model performance on train and validate in model_performance.csv

  • Pick the model with highest accuracy and evaluate on test dataset

🔁 Steps to Reproduce

  • You will need an env.py file that contains the hostname, username and password of the mySQL database that contains the telco table. Store that env file locally in the repository.
  • Clone my repo (including the imports.py, acquire.py, prepare.py)
  • Confirm .gitignore is hiding your env.py file
  • Libraries used are pandas, matplotlib, seaborn, plotly, sklearn, scipy
  • Follow instructions in zillow_eda workbook and README file
  • Good to run final report 😸

🔑 Key Findings

▪️ The top 5 drivers of property assessed value are:

  • square feet

  • size of bedrooms and bathrooms

  • lot size

  • county

  • has pool or not

▪️ Square feet has a strong positive relationship with property tax assessed value

▪️ Property tax assessed value is dependent on the number of bedrooms and bathrooms.

▪️ Mean of property tax assessed value of Los Angeles, Orange, and Ventura County are not all equal.

  • Los Angeles county has the lowest property tax assessed value on average
  • Orange county has the highest property tax assessed value on average

▪️ Properties with pool has a higher property tax assessed value on average than properties without pool.

▪️ The age of the property (2017 - year_built) has a negative medium correlation with property tax assessed value.

▪️ The machine learning model: polynomial features degree 3 is expected to predict housing prices within variance of $202015 on average on future unseen data polynomial_model

🔆 Recommendations

▪️ To better understand the relationship between features and the target vairable, we need more information on the properties' location

▪️ To improve model's accuracy, we need more accurate latitude and longtitude data to pinpoint the property.

🔜 Next Steps

▪️ Collect more geographic data on the property(e.g. local school, surrounding properties, distance from downtown, city population, etc.)

▪️ Develop machine learning models with higher accuracy (lower RMSE) with these additonal data and make better predictions.

▪️ Collect data on previous years to analyze the general trend of each area, and determine what features drive the housing prices the most.

About

Using machine learning regression model to predict property tax assessed value


Languages

Language:Jupyter Notebook 100.0%Language:Python 0.0%