dm-medvedev / forest-research

Methods for increasing generalization ability based on different ways of ensembles building

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

forest-research

Methods for increasing generalization ability based on different ways of ensembles building
For more details you can read thesis here or open files Thesis.pdf and Presentation.pdf

Annotation

This project's aim is development and research of new ensemble method based on decision trees maximally remote from each other. Below can be found comparison between the method presented in this project with other well known ensemble models: Random Forest and Adaptive Boosting.

Errors decompositions

There two factors which influence ensemble quality: quality of each ensemble's estimators and "difference" between each ensemble's estimators. Correctness of this statement can be shown by few different error decompositions which can be find in [1].

Method's work

  1. y(x) is a true label for x object.

  2. K is number of classes.

  3. Node is a set of objects placed in current node, for which a feature and threshold are searched for.

  4. drawing is a tree built on step with number M.

  5. Leaf(x) is a set of objects placed in the same leaf node as object x.
    drawing

  6. drawing is an ensemble built on step with number M.
    drawing
    drawing

  7. drawing is a coefficient of previously builded trees' influence.

Below is placed general formula for building decision tree:

drawing

Below is formula which determine H(s) particulary for the method considered in this project:

drawing

General idea is to build different trees using the ensemble built on previous step, maximize its entropy and minimize the entropy of real labels.

Experiments

In experiments below method realized in this project is compared with Random Forest [2] and Adaptive Boosting [3] and also with combination of different pairs of these methods. All data can be found in UCI Machine Learning Repository [4]. Each step of experiment (x axis) is creation of one new tree for each algorithm involved in comparison.

Datasets

All datasets were randomly splitted to 2 equal parts 5 times. For each of this split one part used for train and another one for test, and then vice versa. Then all 10 different quality measures were averaged. Result for each step of algorithm can be seen in pictures below.

Classification task Train size Test size Features Classes
Optical Recognition of Handwritten Digits Data Set 5620 None 64 10
Credit scoring 1000 None 24 2
Glass Identification Data Set 214 None 9 6
Connectionist Bench (Sonar, Mines vs. Rocks) Data Set 208 None 60 2
Vehicle silhouettes 846 None 18 4

Optical Recognition of Handwritten Digits Data Set

Accuracy

drawing drawing drawing drawing

Credit scoring

Accuracy

drawing drawing drawing drawing

ROC-AUC

drawing drawing drawing drawing

Glass Identification Data Set

Accuracy

drawing drawing drawing drawing

Connectionist Bench (Sonar, Mines vs. Rocks) Data Set

Accuracy

drawing drawing drawing drawing

ROC-AUC

drawing drawing drawing drawing

Vehicle silhouettes

Accuracy

drawing drawing drawing drawing

Literature references

[1] Zhi-Hua Zhou. Ensemble Methods: Foundations and Algorithms. — Chapman and Hall/CRC, 2012.
[2] Random Forest Classifier
[3] Ada Boost Classifier
[4] UCI Machine Learning Repository

About

Methods for increasing generalization ability based on different ways of ensembles building


Languages

Language:C++ 54.5%Language:Jupyter Notebook 34.6%Language:Python 10.5%Language:Makefile 0.3%