scatterplot-matrix ozone variance-decompotion-proportion vif rmse alternating-conditional-expectation qq-plot shapiro-wilk breusch-pagan ridge-regression principal-components-regression aic mallows-cp stepwise-selection scree-plot box-cox auto-arima prediction multiple-time-series r

regressionProjectIITK

This repo is for a group project for the course MTH416A : Regression Analysis during the academic session 2021-2022 (even semester) at IIT Kanpur.

Project Title:

Ozone concentration and meteorology in the LA Basin, 1976 - A Regression Study [Report] [Presentation]

Project Guide

Prof. Sharmishtha Mitra, Department of Mathematics and Statistics, IIT Kanpur

Project Members :

Project Outline

Setup	Topic
	1. Introduction
	2. Data Description
	3. Exploratory Data Analysis
Parametric	4. Multicollinearity
	Detection: Eigen-decompostion Proportion Variance Inflation Factor Remedy: Variable Drop (Model A) Ridge Regression (Model B) Principal Components Regression (Model C)
	5. Variable Selection
	Selection Methods: Best Subset Selection Mallow's Cp Adjusted $R^2$ AIC vs p Plot Scree Plot and Validation Plot
	6. Heteroscedasticity of Errors
	Detection: Breusch-Pagan Test Remedy: Box-Cox Transformation
	7. Normality of Errors
	Detection: Q-Q Plot Shapiro-Wilks Test
	8. Autocorrelation
	Detection: $\epsilon_t$ vs. $\epsilon_{t-1}$ Plot Durbin-Watson Test Remedy: ARIMA Fitting
	9. Prediction
Nonparametric	10. Alternating Conditional Expectation (ACE) Optimal Transformations Plot
	11. Final Model Fit and Predictions

Summary of Fitted Models:

Model Type	Model Name	$R^2$	RMSE
Parametric	Model 0	0.6986	4.2745
	Model A	0.7662	0.8272
	Model B	0.7202	0.8830
	Model C	0.7077	1.2565
Nonparametric	ACE	0.8271	0.3132

Conclusions:

Among the parametric models, model A has the highest $R^2$ value as well as the lowest $RMSE$ value.
All models - A, B and C are better than the baseline model Model 0. This validates our corrections for multicollinearity, heteroscedasticity and autocorrelation and variable selection.
Simple non-parametric models are better if the problem of prediction is to be solved. But here, the ACE model transforms the data so that maximum $R^2$ can be achieved. And, as expected it has the highest $R^2$ value and the lowest $RMSE$ value amond all the models.
So among the models considered here, ACE model is the best, both for the problem of prediction and for the purpose of explaining ozone concentration by the meteorological variables based on the ozone dataset.

References:

Leo Breiman & Jerome H. Friedman (1985): Estimating Optimal Transformations for Multiple Regression and Correlation, Journal of the American Statistical Association, 80:391, 580-598
Jolliffe, Ian T. (1982). "A note on the Use of Principal Components in Regression". Journal of the Royal Statistical Society, Series C. 31 (3): 300–303. doi:10.2307/2348005. JSTOR 2348005.
Sung H. Park (1981). "Collinearity and Optimal Restrictions on Regression Parameters for Estimating Responses". Technometrics. 23 (3): 289–295. doi:10.2307/1267793.
Wilkinson, L., & Dallal, G.E. (1981). Tests of significance in forward selection regression with an F-to enter stopping rule. Technometrics, 23, 377–380
Akaike, H. (1973), "Information theory and an extension of the maximum likelihood principle", in Petrov, B. N.; Csáki, F. (eds.), 2nd International Symposium on Information Theory, Tsahkadsor, Armenia, USSR, September 2-8, 1971, Budapest: Akadémiai Kiadó, pp. 267–281. Republished in Kotz, S.; Johnson, N. L., eds. (1992), Breakthroughs in Statistics, I, Springer-Verlag, pp. 610–624.
Akaike, H. (1974), "A new look at the statistical model identification", IEEE Transactions on Automatic Control, 19 (6): 716–723, doi:10.1109/TAC.1974.1100705, MR 0423716.
Shapiro, S. S.; Wilk, M. B. (1965). "An analysis of variance test for normality (complete samples)". Biometrika. 52 (3–4): 591–611. doi:10.1093/biomet/52.3-4.591. JSTOR 2333709. MR 0205384. p. 593
Breusch, T. S.; Pagan, A. R. (1979). "A Simple Test for Heteroskedasticity and Random Coefficient Variation". Econometrica. 47 (5): 1287–1294. doi:10.2307/1911963. JSTOR 1911963. MR 0545960.
Box, George E. P.; Cox, D. R. (1964). "An analysis of transformations". Journal of the Royal Statistical Society, Series B. 26 (2): 211–252. JSTOR 2984418. MR 0192611.
Durbin, J.; Watson, G. S. (1950). "Testing for Serial Correlation in Least Squares Regression, I". Biometrika. 37 (3–4): 409–428. doi:10.1093/biomet/37.3-4.409. JSTOR 2332391
Durbin, J.; Watson, G. S. (1951). "Testing for Serial Correlation in Least Squares Regression, II". Biometrika. 38 (1–2): 159–179. doi:10.1093/biomet/38.1-2.159. JSTOR 2332325
Faraway, J.J. (2004). Linear Models with R (1st ed.). Chapman and Hall/CRC. https://doi.org/10.4324/9780203507278
Hoerl, A. E., Kennard, R. W. and Baldwin, K. F. (1975). Ridge regression: Some simulations. Communications in Statistics-Theory and Methods, 4(2), 105-123.

About

This is a group project for MTH416A: Regression Analysis at IIT Kanpur

scatterplot-matrix ozone variance-decompotion-proportion vif rmse alternating-conditional-expectation qq-plot shapiro-wilk breusch-pagan ridge-regression principal-components-regression aic mallows-cp stepwise-selection scree-plot box-cox auto-arima prediction multiple-time-series r

Languages

Language:R 100.0%