anewt225 / mod_4_project

Flatiron Mod 4 Time-Series Real Estate Project

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Real Estate Consulting Proposal

Flatiron Mod 4 Project

Overview

This project, a consulting proposal for a real estate investment firm, uses Time Series Analysis to present the top 5 zip codes for the firm to invest in in the US.

ReadMe Navigation

  1. Repository Navigation

  2. Business Understanding

  3. Data Understanding

  4. Predictive Analysis

    1. Model Performance
    2. Prediction
  5. Conclusion

    1. Recommendations
    2. Areas for Growth
  6. Project Info


Repository Navigation

Business Understanding

The business context for this project is relatively straightforward - select 5 top zip codes for investment within the United States, based on a dataset given from Zillow.

GOAL: Predict top 5 zip code in United States for real estate firm to invest in.

To further clarify this goal, we use a risk-adjusted Return on Investment (ROI) as our metric for "top":

  • ROI: Predicted Price Increase / Initial Investment
  • Risk-adjustment: After some exploratory data analysis (EDA), we define risk-adjusted by limiting the range of zip codes to those without large ranges of error in the model prediction.

Data Understanding

The dataset for this project came from Zillow, with the following data from 1996 through 2018:

  • RegionID (Zip Code)
  • RegionName (Zillow-defined code)
  • City
  • State
  • Metro (Metropolitan Area)
  • CountyName
  • Size Rank (Ordinal City size categories)
  • Monthly Median Prices

An overview of the dataset:

import pandas as pd
df = pd.read_csv('data/raw/zillow_data.csv')
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
RegionID RegionName City State Metro CountyName SizeRank 1996-04 1996-05 1996-06 ... 2017-07 2017-08 2017-09 2017-10 2017-11 2017-12 2018-01 2018-02 2018-03 2018-04
0 84654 60657 Chicago IL Chicago Cook 1 334200.0 335400.0 336500.0 ... 1005500 1007500 1007800 1009600 1013300 1018700 1024400 1030700 1033800 1030600
1 90668 75070 McKinney TX Dallas-Fort Worth Collin 2 235700.0 236900.0 236700.0 ... 308000 310000 312500 314100 315000 316600 318100 319600 321100 321800
2 91982 77494 Katy TX Houston Harris 3 210400.0 212200.0 212200.0 ... 321000 320600 320200 320400 320800 321200 321200 323000 326900 329900
3 84616 60614 Chicago IL Chicago Cook 4 498100.0 500900.0 503100.0 ... 1289800 1287700 1287400 1291500 1296600 1299000 1302700 1306400 1308500 1307000
4 93144 79936 El Paso TX El Paso El Paso 5 77300.0 77300.0 77300.0 ... 119100 119400 120000 120300 120300 120300 120300 120500 121000 121500

5 rows × 272 columns

The nationwide median price looked like this:

Nationwide Median

Data Analysis

  • Given the monthly price measurements, we used Time Series Analysis to gain a better understanding for the patterns in the data that would impact our predictions.

  • Additionally, we restrict our model construction to data after the 2008 recession, since the variation due to the anomaly is already known, and our model will perform better without having to account for such a huge fluctuation.

  • Finally, given computational resource restrictions, we limit our dataset to the 60 zip codes with the largest ROI in a 5-year rolling window.

Predictive Analysis

Time Series Analysis

By using time-series analysis, we can use time-linked historical data to discover which areas have the highest ROI, which is a time-delta linked indicator. This particular method enables median price value prediction.

In order to use this model, we must eliminate noise in the data in order to make accurate predictions, based on the assumptions of the model:

  1. Stationarity - Ensures that the distribution of data does not change over time

  2. Seasonality - Adjusting for regular fluctuations based on a fixed interval (e.g., higher prices in spring, lower prices in fall)

  3. Autocorrelation - Adjusts for covariance of time-series with itself, based on its variables.

  4. Trend - Adjusts for long-term trends such as overall increase, overall decrease.

The goal of these limitations is to remove the time-based impacts upon the median price to discover any potential underlying patterns in the data.

Performance Metric

Model Selection

Conclusions

Recommendations:

  1. Higher Median Investment

  2. Lower Median Investment

Areas for Growth:

Include Adjacent Data Resources

With additional time, we would incorporate other ways to measure "top 5" and ROI, including usin other sources like:

  • Zillow Rent Data
  • Competitor Median Prices
  • AirBnB rental increases/returns

Improve Model

The model was significantly limited due to time and resource constraints. With more of each, we could:

  • Fit model to all zip codes, not just limited set
  • Consider grouping zip codes and areas into better tiers
  • Train model on nationwide data for the ability to "drill-down" more precisely

Extend Time Frame for Analysis

With more data, we could predict longer term trends instead of just the limited 5-year period we selected.

Project Info

Contributors: Alexander Newton, Jake Miller Brooks

Languages : Python

Tools/IDE : Git, Command Line (Windows), Anaconda, Jupyter Notebook / Jupyter Lab, Google Slides

Libraries : numpy, pandas, matplotlib, seaborn, scikit-learn, statsmodels

Duration : August 2020 Last Update: 08.24.2020

About

Flatiron Mod 4 Time-Series Real Estate Project


Languages

Language:Jupyter Notebook 99.9%Language:Python 0.1%