eKeiran / US_Home_Price_Factors_Data_Analysis

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

US Home Price Factors Data Analysis

GOAL: To find available data for key factors that influence US home prices nationally. Then, build a data science model that explains how these factors impacted home prices over the last 20 years.

Final Result Summary: Training and evaluating Ridge Regression, Random Forest Regression, and Gradient Boosting Regression models.

Out of three models chosen, Random Forest most accurately explains how key factors impacted home prices with the highest R Square Score and lowest Mean Absolute Error.

image

Second Image

I. DATA COLLECTION: [https://fred.stlouisfed.org/].

II. DATA CLEANING: DataCleaning.ipynb

The data cleaning function does the following:

  1. Function: process_and_save_csv loads, formats, resamples (if needed), filters by date range, renames columns, and saves the cleaned CSV to "CleanData".
  2. Directory: Ensures "CleanData" directory exists.
  3. Processing: Iterates through a list of datasets, applying the function to each for standardized cleaning and saving.

III. EXPLORATORY DATA ANALYSIS: EDA.ipynb

The Exploratory Data Analysis (EDA) involved:

  1. Data Visualization: Using seaborn and matplotlib to create visualizations like histograms, scatter plots, and correlation heatmaps.
  2. Correlation Analysis: Analyzing the correlation between different features and the target variable using correlation matrices and pair plots.

IV. MODEL TRAINING: ModelTraining.ipynb

The model training process included:

  1. Splitting the data into training and testing sets.
  2. Feature selection using Recursive Feature Elimination (RFE) with different regression models.
  3. Training and evaluating Ridge Regression, Random Forest Regression, and Gradient Boosting Regression models.
  4. Calculating Mean Absolute Error (MAE) and R-squared (R2) for each model.

V. FINAL RESULT:

The trained models yielded the following results:

Ridge Regression:

Mean Absolute Error (MAE): 🟥 13.859

R-squared (R2): 🟧 0.996

Random Forest Regression:

Mean Absolute Error (MAE): 🟩 4.577

R-squared (R2): 🟩 0.999

Gradient Boosting Regression:

Mean Absolute Error (MAE): 🟧 5.888

R-squared (R2): 🟩 0.999

The Random Forest Regression and Gradient Boosting Regression models performed better than Ridge Regression based on lower MSE values and perfect R2 scores, indicating a strong predictive ability of these models for US home prices.

About


Languages

Language:Jupyter Notebook 100.0%