California-Housing-Price-Prediction--Regression

1. Work Flow

Housing Data ---> Data Preprocessing ---> Train Test Split ---> Model Fiting ---> Prediction

2.1. What is Linear Regression ?

Linear regression is a basic and commonly used type of predictive analysis. The overall idea of regression is to examine two things:

(1) does a set of predictor variables do a good job in predicting an outcome (dependent) variable

(2) Which variables in particular are significant predictors of the outcome variable, and in what way do they–indicated by the magnitude and sign of the beta estimates–impact the outcome variable?

These regression estimates are used to explain the relationship between one dependent variable and one or more independent variables. The simplest form of the regression equation with one dependent and one independent variable is defined by the formula y = c + b*x,

where y = estimated dependent variable score, c = constant, b = regression coefficient, and x = score on the independent variable.

2.2. What is Decision Tree Regression ?

Decision tree regression observes features of an object and trains a model in the structure of a tree to predict data in the future to produce meaningful continuous output. Continuous output means that the output/result is not discrete, i.e., it is not represented just by a discrete, known set of numbers or values.

Discrete output example: A weather prediction model that predicts whether or not there’ll be rain in a particular day.

Continuous output example: A profit prediction model that states the probable profit that can be generated from the sale of a product.

2.3. What is Random Forest Regression ?

A Random Forest is an ensemble technique capable of performing both regression and classification tasks with the use of multiple decision trees and a technique called Bootstrap and Aggregation, commonly known as bagging. The basic idea behind this is to combine multiple decision trees in determining the final output rather than relying on individual decision trees. Random Forest has multiple decision trees as base learning models. We randomly perform row sampling and feature sampling from the dataset forming sample datasets for every model. This part is called Bootstrap. We need to approach the Random Forest regression technique like any other machine learning technique Design a specific question or data and get the source to determine the required data. Make sure the data is in an accessible format else convert it to the required format. Specify all noticeable anomalies and missing data points that may be required to achieve the required data. Create a machine learning model Set the baseline model that you want to achieve Train the data machine learning model. Provide an insight into the model with test data Now compare the performance metrics of both the test data and the predicted data from the model. If it doesn’t satisfy your expectations, you can try improving your model accordingly or dating your data or use another data modeling technique. At this stage you interpret the data you have gained and report accordingly.

3. Importing library

4. Loading Dataset

""" About data: No Column Non-Null Count Dtype ---- ------ --------------- ------- 0 longitude 20640 non-null float64 1 latitude 20640 non-null float64 2 housing_median_age 20640 non-null float64 3 total_rooms 20640 non-null float64 4 total_bedrooms 20433 non-null float64 5 population 20640 non-null float64 6 households 20640 non-null float64 7 median_income 20640 non-null float64 8 median_house_value 20640 non-null float64 9 ocean_proximity 20640 non-null object """

Visualization of Dataset

Preparing Data for Regression

What is Stratified Shuffle Split ?

Stratified sampling aims at splitting a data set so that each split is similar with respect to something. In a classification setting, it is often chosen to ensure that the train and test sets have approximately the same percentage of samples of each target class as the complete set. We already used train test split then why Stratified Shuffle Split/ what is difference between Stratified Shuffle Split vs train test split ? Train Test Split : It is used for classification or regression problems and is used for any supervised learning algorithm. The procedure involves taking a dataset and dividing it into two subsets( Training dataset and Testing dataset). Stratified Shuffle Split: Using StratifiedShuffleSplit the proportion of distribution of class labels is almost even between train and test dataset. Why are we using Stratified Shuffle Split ? When we split our data into train data and test data we want that training data has same apporixmate disribution of values as the original set od data did Syntax: from sklearn.model_selection import StratifiedShuffleSplit sklearn.model_selection.StratifiedShuffleSplit(n_splits=10, *, test_size=None, train_size=None, random_state=None)

What is Simple Imputer ?

SimpleImputer is a scikit-learn class which is helpful in handling the missing data in the predictive model dataset. It replaces the NaN values with a specified placeholder.

Scaling Data

What is Standard Scaler ?

StandardScaler standardizes a feature by subtracting the mean and then scaling to unit variance.

Unit variance means dividing all the values by the standard deviation.

StandardScaler makes the mean of the distribution 0.

About 68% of the values will lie be between -1 and 1.

What does scaler transform do?

The idea behind StandardScaler is that it will transform your data such that its distribution will have a mean value 0 and standard deviation of 1. In case of multivariate data, this is done feature-wise (in other words independently for each column of the data)

What is difference between normalization and standardization ?

Normalization typically means rescales the values into a range of [0,1].

Standardization typically means rescales data to have a mean of 0 and a standard deviation of 1

Summary:

So, Here we had housing data of california so first we visulaized data then we look for missing value or null values and tried to replace them or remove them using various technique and then we splited our data into training data and testing data then we moved to visualize our training data and to verify proper distribution and then we trained model and calculated RMSE STD , Mean,etc. And we were done !

Here we just used Linear Regression, Decision Tree Regression , Random Forest Regression and Support Vector Regression. You can use any regression model checkout sklearn documentation for more information

Note: Please note these and get information about them because without these we cannot perform Regression on this dataset

train_test_split Stratified Shuffle Split Simple Imputer Ordinal Encoding One Hot Encoding Standard Scaler

Conclusion

The best imputer strategy is most_frequent and apparently almost all features are useful (15 out of 16). The last one (ISLAND) seems to just add some noise.

meetttttt / California-Housing-Price-Prediction--Regression