decision-tree-classifier ensembles machine-learning cython-wrapper cpp jupyter-notebook forest thesis decision-tree

forest-research

Methods for increasing generalization ability based on different ways of ensembles building
For more details you can read thesis here or open files Thesis.pdf and Presentation.pdf

Annotation

This project's aim is development and research of new ensemble method based on decision trees maximally remote from each other. Below can be found comparison between the method presented in this project with other well known ensemble models: Random Forest and Adaptive Boosting.

Errors decompositions

There two factors which influence ensemble quality: quality of each ensemble's estimators and "difference" between each ensemble's estimators. Correctness of this statement can be shown by few different error decompositions which can be find in [1].

Method's work

y(x) is a true label for x object.
K is number of classes.
Node is a set of objects placed in current node, for which a feature and threshold are searched for.
is a tree built on step with number M.
Leaf(x) is a set of objects placed in the same leaf node as object x.
is an ensemble built on step with number M.
is a coefficient of previously builded trees' influence.

Below is placed general formula for building decision tree:

Below is formula which determine H(s) particulary for the method considered in this project:

General idea is to build different trees using the ensemble built on previous step, maximize its entropy and minimize the entropy of real labels.

Experiments

In experiments below method realized in this project is compared with Random Forest [2] and Adaptive Boosting [3] and also with combination of different pairs of these methods. All data can be found in UCI Machine Learning Repository [4]. Each step of experiment (x axis) is creation of one new tree for each algorithm involved in comparison.

Datasets

All datasets were randomly splitted to 2 equal parts 5 times. For each of this split one part used for train and another one for test, and then vice versa. Then all 10 different quality measures were averaged. Result for each step of algorithm can be seen in pictures below.

Classification task	Train size	Test size	Features	Classes
Optical Recognition of Handwritten Digits Data Set	5620	None	64	10
Credit scoring	1000	None	24	2
Glass Identification Data Set	214	None	9	6
Connectionist Bench (Sonar, Mines vs. Rocks) Data Set	208	None	60	2
Vehicle silhouettes	846	None	18	4

Optical Recognition of Handwritten Digits Data Set

Accuracy

Credit scoring

Accuracy

ROC-AUC

Glass Identification Data Set

Accuracy

Connectionist Bench (Sonar, Mines vs. Rocks) Data Set

Accuracy

ROC-AUC

Vehicle silhouettes

Accuracy

Literature references

[1] Zhi-Hua Zhou. Ensemble Methods: Foundations and Algorithms. — Chapman and Hall/CRC, 2012.
[2] Random Forest Classifier
[3] Ada Boost Classifier
[4] UCI Machine Learning Repository

About

Methods for increasing generalization ability based on different ways of ensembles building

decision-tree-classifier ensembles machine-learning cython-wrapper cpp jupyter-notebook forest thesis decision-tree

Languages

Language:C++ 54.5%Language:Jupyter Notebook 34.6%Language:Python 10.5%Language:Makefile 0.3%