The code was developed using the Anaconda distribution of Python, versions 3. Python libraries used are numpy
, pandas
, datetime
, matplotlib
, seaborn
, sklearn
, scipy
, statsmodels
, random
, PIL
, requests
, collections
, and pickle
.
In this project, I used open source Chicago Airbnb data (http://insideairbnb.com/get-the-data.html) to answer 4 business questions:
- Q1: How do listing information (description words, price per person per nignt) differ among different neiborhoods?
- Q2: Is there a general upward trend of both new Airbnb listings and total Airbnb visitors to Chicago?
- Q3: What are the busiest times of a year to visit Chicago? By how much do prices spike?
- Q4: What are the factors that explain the listing price the most?
These questions were answered using statistics, regression, and visualization.
There are 4 notebooks available here to showcase work related to the above questions.
explore_part1.ipynb
: load data and design new features based on the available datasetexplore_part2.ipynb
: exploratory data analysis to answer Q1-Q3model_part1.ipynb
: prepare data for regression analysis, including the handling of categorical variables and missing datamodel_part2.ipynb
: train regression model and use the trained model to answer Q4
Markdown cells in each notebook were used to assist in walking through the thought process for individual steps.
The main findings of the code can be found at the post available here.
Please find the Licensing for the dataset at Airbnb data portal. Other than that, feel free to play with the code here.