Project for STATS 503 (Multivariate Regression) at University of Michigan, where we were tasked to analyze a public dataset using statistical modeling techniques in R.
Determine the features most likely to lead to a sale from the Online Shoppers Purchasing Intention dataset from UCI's ML archive. This README summarizes the major steps and insights, but the finer details will be in models.md. An Rpubs version can be found here.
Many of the modeling techniques are taken from the following packages:
- e1071 (for Naive Bayes and SVM)
- class (for KNN)
A lot of the columns have to be changed to represent a categorical variable.
It's important that, when doing a 70-30 split on the data, there is a balance in the the training and testing sets.
- Logistic Regression
- Reduced Logistic Regression
- Naive Bayes
- LDA
- SVM
- Random Forest
- Adaboost
Training Error | Testing Error | |
---|---|---|
Naive Bayes | 0.1887601 | 0.1837838 |
Logistic Regression | 0.1165701 | 0.1132739 |
KNN | 0.1093859 | 0.1091892 |
SVM | 0.0936269 | 0.1027027 |
Random Forest | 0.0000000 | 0.0918919 |
Adaboost | 0.0085747 | 0.1035135 |
PageValues is an incredibly important feature in this dataset for predicting revenue. The biggest problem here is that PageValues is computed by dividing the revenue of a shopping instance by the number of views a page gets. This inherently encodes revenue into the problem, and does not really reflect the volitional behavior of the shopper. In fact, PageValue by itself does a great job at classifying Revenue on its own (achieving about 10% error on the testing set).