Fernandofinyellow / Site-Energy-Intensity-Prediction-Project

In this repo, I have shared on energy intensity prediction project to measure site EUI (Energy usage intensity) in a given year.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Site-Energy-Intensity-Prediction-Project

In this repo, I have worked on energy intensity prediction project to measure site EUI (Energy usage intensity) in a given year.

Description:

According to a report issued by the International Energy Agency (IEA), the lifecycle of buildings from construction to demolition was responsible for 37% of global energy-related and process-related CO2 emissions in 2020. Yet it is possible to drastically reduce the energy consumption of buildings by a combination of easy-to-implement fixes and state-of-the-art strategies.

The dataset consists of building characteristics, weather data for the location of the building, as well as the energy usage for the building, and the given year, measured as Site Energy Usage Intensity (Site EUI). Each row in the data corresponds to a single building observed in a given year.

Source of dataset & data dictionary - Click Here

Problem Statement:

You are provided with two datasets: (1) the train_dataset where the observed values of the Site EUI for each row are provided and (2) the x_test dataset the observed values of the Site EUI for each row are removed and provided separately in y_test. Your task is to predict the Site EUI for each row (using the complete training dataset), given the characteristics of the building and the weather data for the location of the building. Use the test sets for validation and testing. The target variable is site_eui for the predictive analytics problem.

Evaluation Metric: Root Mean Squared Error (RMSE)

Web application:

Streamlit App

Published article on Analytics Vidhya:

Article link: Click Here

Web app screenshots:

image

image

image

Tasks and techniques used:

1. Data Understanding:

  • EDA to visualize numerical features relations with target variable
  • Heatmap to visualize multicollinear variables
  • Calculated Variance Inflation factors(VIF) using statsmdodels library
  • Used scatteplot for bivariate analysis

Results:

  • 6 Columns are having missing values
  • Energy intensity means usage of electricity and heat (The more the use, more the intensity of energy and CO2 emmissions per building)
  • Bulding characteristics and energy usage over a period of year is given (Every Month's Min, Max and Mean)
  • Samples are from various of states of USA
  • We need to check relations between site conditions and usage of energy

2. Data Preparation:

  • Handled missing values using fillna() method with mean and mode of features
  • Calculated Variation inflation factors(VIFs) to handle multicollinear columns and remove most correlated columns
  • Feature engineering to aggregate the existing columns to reduce dimentionality of dataset
  • Calculated VIFs using statamodels library

Results:

  • Reduced columns from 56 to 17 to build baseline model using XGB and RF baseline

3. Modeling and evaluation

  • Built base machine learning models using XGBoostand RandomForest Regressors
  • Modeling with functional programming in python
  • Evaluation metric - Root Mean Squared Error (RMSE)
  • Hyperparameter tuning using gridserchCV

Results:

  • Baseline score on test dataset using XGBoost model was 49 and 50 for RF model
  • Tuned model brought RMSE score to 46

4. Explainable AI using SHAP

  • Used Shaply values to learn interaction of inputs with the outputs
  • Plotted features shap values charts

Results:

  • From summary plots it was found energy_star_rating and floor_area feature were the most imporatant to predict model output.

Acknowledgement: TMLC Academy

References:

  1. 11 Dimensionality reduction techniques

About

In this repo, I have shared on energy intensity prediction project to measure site EUI (Energy usage intensity) in a given year.


Languages

Language:Jupyter Notebook 100.0%