isa96 / emission-prediction

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Overview

      In this era, emission is the most important thing that we must concern. With high emission there is a lot impact that we can feel, there are air pollution, climate change, etc. if emissions from human activities increase, they build up in the atmosphere and warm the climate, leading to many other changes around the world—in the atmosphere, on land, and in the oceans.as emissions from human activities increase, they build up in the atmosphere and warm the climate, leading to many other changes around the world—in the atmosphere, on land, and in the oceans Climate Change Indicator. and this is can be a big problem for us as a human.

      One of factors that can produced a lot of emissions is transportation, therefore this notebook wants to predict emission from transportation to help reduce high emission vehicle.

Exploratory Data Analysis

  1. Use data.info to see information of each columns and we know that there are 73585 rows and 12 columns
  2. use data.isnull().sum() to check null or missing values in dataset
  3. Because we only need several columns like Engine Size, Cylinders, Fuel Type, Fuel Consumption City, Fuel Consumption Highway (Hwy) and CO2 Emissions(g/km), then we remove the rest using data = data.drop([colums], axis=1)

image

  1. To make user can see the fuel type meaning we change the alphabet representation using actual fuel type

      from: image

      to: image

  1. Visualize total of each fuel type

image

      Then we know that regular gasoline is the highest fuel type that most vehicles use and natural gas is the least fuel type that vehicle use

  1. Then we plot correlation for each numerical data using scatter

image

Data Preprocessing

  1. Because there is fuel type column that contain non numerical value, therefore we need to encode that into numerical value using pd.get_dummies()

image

  1. Define x values and y value for x values contain all independent variables and y values contain label or dependent variable
  2. split x and y into x_train, x_test, y_train, y_test using train_test_split and in this case I use split size 80% for train_size and 20% for test_size

Modeling

      For modeling I use 4 model, there are:

  1. Linear Regression with estimator LinearRegression(fit_intercept=False, n_jobs=30)
  2. Ridge Regression with estimator Ridge(alpha=2.0, solver='svd')
  3. Random Forest Regression with estimator RandomForestRegressor(max_depth=50, max_features=None, min_samples_split=8)
  4. Neural Network with layers like this:

image

Result

      For the result I got accuracy and MAE for each model like this:

image

      Then we can see that best MAE and accuracy goes to Random Forest Regression. Not only that, I alos saved my models into pickle and js for tensorflow or deep learning model

About


Languages

Language:Jupyter Notebook 97.5%Language:Python 2.5%