We analyze Wine Quality Dataset [1]. The main challenges:
- multicollinearity of predictors,
- non obvious linearity between predictors and target,
- imbalanced multiclass.
- Create a new conda environement
- Download dependencies:
pip install -r requirements.txt
- Install
nb_conda
to use your conda env in jupyter notebook.
Use jupyter notebooks. We have used a separate python script for grid search for computation speed. We have prepared functions for ease of evaluation in the notebooks ( evaluate.py and functional.py).
Predict wine quality with white wines as it is (from quality 3 to quality 9), as continous or categorical target. Best classifier:
- Random Forest
- XGBoost with all features.
ROC and PR curves for multiclass
Predict wine quality as binary target. We use the threshold t=7 to binarize the wine quality. Best classifier:
- Random forest
- XGBoost used with SMOTE resampling technique.
Macro f1 score result for different models and datasets
Predict wine quality as binary target. We use the threshold t=8 to binarize the wine quality. Best classifier:
- SVM
- MLP used with SMOTE resampling technique.
Macro f1 score result for different models and datasets
(P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236. (pdf)