SanikaVT / Real-Estate

Machine learning Regression problem with easy understandable solutions

Home Page:https://github.com/SanikaVT/Real-Estate_CaseStudy

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

2.Real Estate(Regression Problem)

1. Business Problem

1.1 Problem Context

Our client is a large Real Estate Investment Trust (REIT).

  • They invest in houses, apartments, and condos(complex of buildings) within a small county in New York state.
  • As part of their business, they try to predict the fair transaction price of a property before it's sold.
  • They do so to calibrate their internal pricing models and keep a pulse on the market.

1.2 Problem Statement

The REIT has hired us to find a data-driven approach to valuing properties.

  • They currently have an untapped dataset of transaction prices for previous properties on the market.
  • The data was collected in 2016.
  • Our task is to build a real-estate pricing model using that dataset.
  • If we can build a model to predict transaction prices with an average error of under US Dollars 70,000, then our client will be very satisfied with the our resultant model.

1.3 Business Objectives and Constraints

  • Deliverable: Trained model file
  • Win condition: Avg. prediction error < $70,000
  • Model Interpretability will be useful
  • No latency requirement

2. Machine Learning Problem

2.1 Data Overview

For this project:

  1. The dataset has 1883 observations in the county where the REIT operates.
  2. Each observation is for the transaction of one property only.
  3. Each transaction was between $200,000 and $800,000.

Target Variable

  • 'tx_price' - Transaction price in USD

Features of the data:

Public records:

  • 'tx_year' - Year the transaction took place
  • 'property_tax' - Monthly property tax
  • 'insurance' - Cost of monthly homeowner's insurance

Property characteristics:

  • 'beds' - Number of bedrooms
  • 'baths' - Number of bathrooms
  • 'sqft' - Total floor area in squared feet
  • 'lot_size' - Total outside area in squared feet
  • 'year_built' - Year property was built
  • 'active_life' - Number of gyms, yoga studios, and sports venues within 1 mile
  • 'basement' - Does the property have a basement?
  • 'exterior_walls' - The material used for constructing walls of the house
  • 'roof' - The material used for constructing the roof

Location convenience scores:

  • 'restaurants' - Number of restaurants within 1 mile
  • 'groceries' - Number of grocery stores within 1 mile
  • 'nightlife' - Number of nightlife venues within 1 mile
  • 'cafes' - Number of cafes within 1 mile
  • 'shopping' - Number of stores within 1 mile
  • 'arts_entertainment' - Number of arts and entertainment venues within 1 mile
  • 'beauty_spas' - Number of beauty and spa locations within 1 mile
  • 'active_life' - Number of gyms, yoga studios, and sports venues within 1 mile

Neighborhood demographics:

  • 'median_age' - Median age of the neighborhood
  • 'married' - Percent of neighborhood who are married
  • 'college_grad' - Percent of neighborhood who graduated college

Schools:

  • 'num_schools' - Number of public schools within district
  • 'median_school' - Median score of the public schools within district, on the range 1 - 10

2.2 Mapping business problem to ML problem

2.2.1 Type of Machine Learning Problem

It is a regression problem, where given the above set of features, we need to predict the transaction price of the house.

2.2.2 Performance Metric (KPI)

Since it is a regression problem, we will use the following regression metrics:

  • Root Mean Squared Error (RMSE)
  • R-squared
  • Mean Absolute Error

3.Feature Engineering

3.1.Features Created:

two_and_two:
  • An indicator variable to flag properties with 2 beds and 2 baths and name it 'two_and_two'.
old_properties:
  • People might also not take much interest in old properties.
  • so, this feature gives properties built before 1980.
tax_and _insurance:
  • Both tax and insurance are monthly quantities property holder needs to pay monthly
during_recession:
  • A new feature called 'during_recession' is created to indicate if a transaction falls between 2010 and 2013.
propery_age:
  • property_age' denotes the age of the property when it was sold and not how old it is today, since we want to predict the price at the time when the property is sold.
school_score:
  • A school score feature is created as num_schools * median_school
One hot Encoding:
  • Machine learning algorithms cannot directly handle categorical features. Specifically, they cannot handle text values.
  • Therefore, we need to create dummy variables for our categorical features.
  • Dummy variables are a set of binary (0 or 1) features that each represent a single class from a categorical feature.

3.2.Features Removed:

  • Features such as property_tax, year_built, tx_year, insurance are removed.

About

Machine learning Regression problem with easy understandable solutions

https://github.com/SanikaVT/Real-Estate_CaseStudy


Languages

Language:Jupyter Notebook 100.0%