AdroitAnandAI / Hand-coded-SGD-vs-Sklearn-OLS-vs-Batch-Gradient-Descent-Analysis

Performance comparison of self-coded Stochastic Gradient Descent vs Sklearn OLS vs Batch Gradient Descent

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Stochastic Gradient Descent vs Batch GD vs Sklearn’s OLS

Objective

  1. To implement Stochastic Gradient Descent (SGD) based on how the gradient descent logic works, to minimize the cost so as to find the best fit.

  2. Compare and analyse the difference in outcome between self implementation of SGD vs sklearn’s Ordinary Least Squares (OLS) implementation. You may use graphical plots to do the same.

  3. Implement Batch Gradient Descent to compare the outcome: both timing as well as data.

At a glance

  1. Stochastic Gradient Descent (SGD) is implemented and cost analysis has been down for every 100 iterations. It has been tested for different batch sizes & iterations, to find out difference in RMSE, graphically depicted using scatter plots. The formulas used in SGD implementation is given in the report below.

  2. Sklearn’s Ordinary Least Squares (OLS) is used on the same dataset and timing and error evaluation has been done for head to head comparison. Batch Gradient Descent algorithm is also implemented for comparison.

  3. The timing comparison of all the 4 methods: Batch Gradient Descent, Stochastic GD, low K SGD and Sklearn’s OLS has been done. The PDF of errors is plotted with kdeplot to identify the deviation of distribution from actual target value distribution. The summary of results and conclusion is provided at the end of the report.

Data Source:

Boston Dataset from Sklearn Datasets.

Dataset Details on Boston House Prices

Characteristics:

Number of Instances: 506 Number of Attributes: 13 numeric/categorical predictive Median Value (attribute 14) is usually the target

Attribute Information (in order):

  • CRIM per capita crime rate by town
  • ZN proportion of residential land zoned for lots over 25,000 sq.ft.
  • INDUS proportion of non-retail business acres per town
  • CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
  • NOX nitric oxides concentration (parts per 10 million)
  • RM average number of rooms per dwelling
  • AGE proportion of owner-occupied units built prior to 1940
  • DIS weighted distances to five Boston employment centres
  • RAD index of accessibility to radial highways
  • TAX full-value property-tax rate per $10,000
  • PTRATIO pupil-teacher ratio by town
  • B 1000(Bk - 0.63)ˆ2 where Bk is the proportion of blacks by town
  • LSTAT % lower status of the population
  • MEDV Median value of owner-occupied homes in $1000's

Missing Attribute Values: None

Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset. http://archive.ics.uci.edu/ml/datasets/Housing

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics ...', Wiley, 1980. N.B. Various transformations are used in the table on pages 244-261 of the latter. The Boston house-price data has been used in many machine learning papers that address regression problems.

References

  • Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of
  • Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the
  • many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)

Hand Coding of Stochastic Gradient Descent

The error is calculated using the below formula at every iteration. The derivative term, w.r.t. w and b has to be negated on every iteration. Derivate is calculated using this formula at every iteration.

Then we negate the parameter gradient from each parameter, adjusted by learning rate. We use this formula, params = params - learning rate * params_gradient.

Low K, Stochastic Gradient Descent: Cost Analysis
Cost of iteration #100 = 72.81
Cost of iteration #200 = 59.0
Cost of iteration #300 = 159.28
Cost of iteration #400 = 17.85
Cost of iteration #500 = 1934.25
Cost of iteration #600 = 30.67
Cost of iteration #700 = 394.6
Cost of iteration #800 = 575.87
Cost of iteration #900 = 11.97
Cost of iteration #1000 = 98.75
Cost of iteration #1100 = 190.68
Cost of iteration #1200 = 13.14
Cost of iteration #1300 = 7.5
Cost of iteration #1400 = 85.12
Cost of iteration #1500 = 49.15
Cost of iteration #1600 = 54.78
Cost of iteration #1700 = 10.78
Cost of iteration #1800 = 10.68
Cost of iteration #1900 = 30.55
Cost of iteration #2000 = 123.18

Stochastic Gradient Descent: Cost Analysis
Cost of iteration #100 = 14.76
Cost of iteration #200 = 52.03
Cost of iteration #300 = 232.23
Cost of iteration #400 = 31.24
Cost of iteration #500 = 15.02
Cost of iteration #600 = 4.38
Cost of iteration #700 = 24.46
Cost of iteration #800 = 22.23
Cost of iteration #900 = 9.26
Cost of iteration #1000 = 7.12

Batch Gradient Descent: Cost Analysis
Iteration #100 Cost = 19.69
Iteration #200 Cost = 19.56
Iteration #300 Cost = 19.55
Iteration #400 Cost = 19.55
Iteration #500 Cost = 19.55
Iteration #600 Cost = 19.55
Iteration #700 Cost = 19.55
Iteration #800 Cost = 19.55
Iteration #900 Cost = 19.55
Iteration #1000 Cost = 19.55

RMSE LowK SGD = 11.1
RMSE of SGD = 2.67
RMSE of GD = 4.42

Percentage change in Weight Vectors from GD to SGD = 0.2%

hc1

hc2

hc3

Linear Regression using Sklearn’s OLS

RMSE = 5.31

ols

Timing Comparison of SGD, Batch GD, SKlearn & Low K SGD

Time Taken by Low K SGD is 1.68 seconds when k = 5
Time Taken by SGD is 1.75 seconds when k = 10
Time Taken by Batch GD is 1.96 seconds when k = 339
Time Taken by Sklearn OLS is 0.0 seconds

Error Comparison of SGD, Batch GD & Sklearn’s OLS

pdf_ols

pdf_gd

pdf_pv

Summary

summary

Conclusion

  1. The scatter plot of Stochastic Gradient Descent and Batch Gradient Descent prediction results shows very similar pattern. RMSE values are also almost the same. Hence, the stochastic variation of gradient descent yields a decent approximation of batch GD, which takes in, all data points in each iteration.

  2. RMSE of Stochastic Gradient Descent is found to be the lowest compared to other algorithms. The RMSE value would fluctuate a bit because the algorithm is inherently stochastic. But, the low RMSE values signify the method is working fine.

  3. RMSE of SGD < Batch GD < Sklearn’s OLS < Low K SGD. The low batch size increases the error value significantly.

  4. Timing of Sklearn’s OLS is great but the RMSE value is higher. There is significant reduction in time, when we do Stochastic GD instead of Batch GD.

  5. The scatter plot of Low K SGD is more perturbed than SGD scatter plot. SGD plot is more linear which signifies less deviation/ error.

  6. When k is low (we have taken k= 5), then the minimized MSE is found to be high. But when we increase iterations, the minimum cost moves towards optimum. Hence, for lower K, iterations should be more.

  7. The PDF of errors in Sklearn’s OLS are centered around 0. From the plot, it is noticed that there are more errors on the -ve side. To improve the solution, we have to reduce the errors on the -ve side.

  8. The PDF of predicted values are centered around 20. As the error PDF is much to the left of predicted PDF, it is found that the % of errors is acceptable.

  9. The PDF of errors of Batch Gradient Descent is similar to the Sklearn’s OLS method. Hence, the error in fit should be almost same. However, the PDF error plot of SGD implementation is way off on the negative side, hence errors are more.

  10. The error distribution kdeplot of SGD implementation would become near to Sklean’s method when the batch size(k) of SGD implementation increases. As we take more points in each iteration, the approximation error would reduce, though it would take more time.

About

Performance comparison of self-coded Stochastic Gradient Descent vs Sklearn OLS vs Batch Gradient Descent


Languages

Language:Jupyter Notebook 100.0%