ekapope / Predicting-condominium-price-using-data-from-webscraping

Scrape Bangkok condominium listing from hipflat.com and compare ML performance

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Predicting condominium price using data from web scraping

1. Data set and explanation web scraping process

This project uses Selenium library to firstly obtain all condominiums listed on the https://www.hipflat.com/ website, and extracts information for each page using BeautifulSoup package. Hipflat is one of the biggest property listing website in Thailand. This project is focused on condominium listings in Bangkok, both new and resale. Refer to below links for Python scripts.

001_Retrieve_all_urls.py

002_Scrape_info_for_each_condo.py

2. Pre-processing and data cleaning

Check NAs and data types for each column. Perform data manipulation by clean each column using regex, change numbers from strings to numeric, impute missing values, and convert lists of strings into columns. Refer to the link below.

003_data_cleaning_pred_current_price.py

3. Data scaling and hyperparameter tuning & ML

Robust Scaler is used in the pipeline before passing through the ML models. It uses a similar method to the Min-Max scaler but it instead uses the interquartile range, rather than the min-max, so that it is robust to outliers.

004_ML_pred_current_price.py

Three machine learning algorithms were used in the project

  1. Ridge
  2. RandomForestRegressor
  3. GradientBoostingRegressor

The results are shown below. Result table

Scatter plot the result of Gradient Boosting Regressor

Summary and suggestion for future improvement

Even this dataset is quite small with lots of features and we can only predict the price per square meters for each condo, however, this study is very useful for buyers, resellers, agents and even developers to justify the 'fair price' as a starting point based on the current actual market data.

In the web scraping step, we should acquire all listings available in each condo, not only average price per sqm. This should increase numerous numbers of records and it would be very useful to estimate the price for every single room in the future.

We dropped the name of public transports, supermarkets, restaurants, schools, hospitals from the basetable before feeding data to the models. With finer feature engineering and variable selections, it could help improve the predicting performance in the future.

Finally, we have scraped some quarterly historical prices but still did not use in this project since there were some unreliability issues in the data. It required some more detailed verification and data cleaning. This historical data can be really useful to visualize the trends for each condo/area (which areas are growing rapidly, which area are reaching plateau stage or declining).

For detail explanation, please refer to the PDF report.

About

Scrape Bangkok condominium listing from hipflat.com and compare ML performance


Languages

Language:HTML 59.2%Language:Python 40.8%