GOAL: To find available data for key factors that influence US home prices nationally. Then, build a data science model that explains how these factors impacted home prices over the last 20 years.
Final Result Summary: Training and evaluating Ridge Regression, Random Forest Regression, and Gradient Boosting Regression models.
Out of three models chosen, Random Forest most accurately explains how key factors impacted home prices with the highest R Square Score and lowest Mean Absolute Error.
I. DATA COLLECTION: [https://fred.stlouisfed.org/].
- CASE-SCHILLER Home Price Index - https://fred.stlouisfed.org/series/CSUSHPISA
- Consumer price index (CPI) - https://fred.stlouisfed.org/series/CPIAUCSL
- Construction price index - https://fred.stlouisfed.org/series/WPUSI012011
- Employment rate - https://fred.stlouisfed.org/series/LREM64TTUSM156S
- Housing Subsidies (Federal) - https://fred.stlouisfed.org/series/L312051A027NBEA
- Interest rates - https://fred.stlouisfed.org/series/FEDFUNDS
- Per Capita GDP - https://fred.stlouisfed.org/series/A939RX0Q048SBEA
- Percent urban population - https://data.worldbank.org/indicator/SP.URB.TOTL.IN.ZS?end=2021&locations=US&start=2001
- Real Median Household Income - https://fred.stlouisfed.org/series/MEHOINUSA672N
- Total households - https://fred.stlouisfed.org/series/TTLHH
- Unemployment rate - https://fred.stlouisfed.org/series/UNRATE
- Working population - https://fred.stlouisfed.org/series/LFWA64TTUSM647S
The data cleaning function does the following:
- Function:
process_and_save_csv
loads, formats, resamples (if needed), filters by date range, renames columns, and saves the cleaned CSV to "CleanData". - Directory: Ensures "CleanData" directory exists.
- Processing: Iterates through a list of datasets, applying the function to each for standardized cleaning and saving.
The Exploratory Data Analysis (EDA) involved:
- Data Visualization: Using seaborn and matplotlib to create visualizations like histograms, scatter plots, and correlation heatmaps.
- Correlation Analysis: Analyzing the correlation between different features and the target variable using correlation matrices and pair plots.
The model training process included:
- Splitting the data into training and testing sets.
- Feature selection using Recursive Feature Elimination (RFE) with different regression models.
- Training and evaluating Ridge Regression, Random Forest Regression, and Gradient Boosting Regression models.
- Calculating Mean Absolute Error (MAE) and R-squared (R2) for each model.
The trained models yielded the following results:
Mean Absolute Error (MAE): 🟥 13.859
R-squared (R2): 🟧 0.996
Mean Absolute Error (MAE): 🟩 4.577
R-squared (R2): 🟩 0.999
Mean Absolute Error (MAE): 🟧 5.888
R-squared (R2): 🟩 0.999