amac-lfc / airbnb

Real Estate Price Prediction using Linear Regression and XGBoost. Geo-Spatial analysis using OSMnx & OpenRouteService and visualization with Folium.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Airbnb Price Prediction Project (Chicago)

Logo

Table of Contents

  1. Foreword
  2. Installation
  3. File Descriptions
  4. Results
  5. Resources

Foreword

This project is a part of James Rocco Research Scholarship provided by Lake Forest College and was carried out under the supervision of Prof. Arthur Bousquet. The main idea is based on an article by Graciela Carrillo posted on Towards Data Science.

Installation

Using Anaconda create a new environment from environment.yml.

conda env create --file environment.yml

File Descriptions (Follow in order)

To conveniently read all the notebooks follow this link.

Notebooks/:

  1. EDA.ipynb (Exploratory Data Analysis) - A brief overview and analysis of raw data
  2. kepler_map.ipynb - Visualization of the whole dataset using Kepler.gl
  3. data_preprocessing.ipynb - Preprocessing the data for future uses (outlier detection, feature selection, handling missing data, etc.)
  4. regressions.ipynb - Development of initial price prediction models
  5. cta_mapping.ipynb - Visualization of geo_loc.py using Folium maps (Map of routes to CTAs in the radius and shortest path detection)
  6. model.ipynb - Final model for price prediciton that compares the results of datasets with and without newly produced variables

Scripts:

  1. geo_loc.py - A python script for geospatial analysis: creates 5 new variables using such libraries as OSMnx and OpenRouteSerivce:
    • Restaurants - Number of restaurants in a 1000 meters radius
    • Cafes - Number of cafes in the radius
    • Bars - Number of bard in the radius
    • CTA - Number of CTA (Chicago Subway) stations in the radius
    • time_to_cta_minutes - Time in minutes to the nearest CTA station (can be out of the radius)

Project Description and Results

The main goal of this project is to build a model that predicts the price of a listing given its dependent variables. The data for both dependent and independent variables is available through Insideairbnb.com. To get a general understanding of the data used for this project, take a look at the map below where the data is projected on the map of Chicago. Listings (i.e. rows in the dataset) are grouped within hexagons whose height represents the listings count and the color represents the price range.

Kepler

The accuracy, i.e. how well the model performs, is measured by R^2 a metric commonly used for regression models that represents the proportion of the variance for a dependent variable that's explained by independent variables. To further improve the accuracy and add originality to the project, 5 new variables are created by analyzing surrounding areas and fetching distances to chosen types of locations as well as calculating the walking time to the nearest subway station.

Since it is possible to visualize locations and routes, below you can see a map with routes to all subway stations within the range of 1000m (the circle) and with the shortest route colored in green. Red dots represent subway station that lie outside the wanted radius.

Map

Here is the list of variables used to predict the price of a listings:

Numerical variables:

  • Accommodates - Number of people a listing can accommodate
  • Bathrooms - Number of bathrooms
  • Minimum_nights - Minimum amount of nights a listing should be booked for
  • Maxium_nights - Maximum amount of nights a listing can be booked for
  • Availability_30 - Number of days a listing is available in the next 30 days
  • Number_of_reviews - Total number of reviews
  • Number_of_reviews_ltm - Number of reviews within last month
  • Restaurants, Bars, Cafes, Universities - Number of places of specified type within 1000 meters from the listing (4 different variables)
  • Time_to_cta_minutes - Time it takes to walk to the nearest subway (in Chicago CTA) station (Distance does not matter)

Categorical variables:

  • Neighbourhood_cleansed - name of the neighborhood a listing is located in
  • Property_type - type of property a listing is located in (e.g. Apartment, Condomonium, House, etc.)
  • Bed_type - type of bed present in a listing
  • Cancellation_policy - type of cancellation policy chosen by the host

As it can be apparent from file descriptions, a step-by-step approach was taken to build the model. To understand the model and the thought process you can read through the notebooks.

To achieve the best possible result I tried various models and these are the results:

_ Linear Regression Lasso Regression Ridge Regression Lasso Regression with Polynomial Features Ridge Regression with Polynomial Features XGBoost
No new variables:
Train R2
0.4298 0.4294 0.4296 0.4918 0.5143 0.6428
Test R2 0.4607 0.462 0.4615 0.4867 0.4925 0.5391
With new variables:
Train R2
0.4387 0.4375 0.4384 0.5036 0.5224 0.6742
Test R2 0.411 0.4163 0.4129 0.4503 0.453 0.5445

A good way to see the difference in modeling between the data without and with the new variables is to look at feature importances computed by XGBoost. Categorical variables are not included not to overcrowd the plot.

As we can see on the bottom plot the new features possess high importance (higher than some of the initial features).

no_new new

Resources

Package Docs

Shapely

Point in Polygon

OSMnx

OSMnx Example Notebooks

openrouteservice-py

Folium

KeplerGL for Jupyter Notebook

DataCamp

Pandas Foundations

Unsupervised Learning 

Supervised Learning 

Youtube Videos

Ridge Regression

Lasso Regression

Machine Learning Tutorial Playlist

Theory behind PCA

Complete Machine Learning Course by Andrew NG

Medium Articles

Airbnb Price Prediction Using Linear Regression (Scikit-Learn and StatsModels)

Ridge and Lasso Regression: L1 and L2 Regularization

Predicting Airbnb prices with machine learning and location data

Exploring Airbnb prices in London: which factors influence price?

How to calculate Travel time for any location in the world

Find and plot your optimal path using OSM, Plotly and NetworkX in Python

Loading Data from OpenStreetMap with Python and the Overpass API

How to Create Eye-Catching Maps With Python and Kepler.gl

Measuring pedestrian accessibility

About

Real Estate Price Prediction using Linear Regression and XGBoost. Geo-Spatial analysis using OSMnx & OpenRouteService and visualization with Folium.


Languages

Language:Jupyter Notebook 99.7%Language:Python 0.3%