This repository is targeted at popular pruning implementions(Continuous updating):
REP:Reduced Error Pruning(finished for ID3)
MEP:Minimum Error Pruning(finished for C4.5)
PEP:Pessimistic Error Pruning(finished for C4.5)
EBP:Error Based Pruning(finished for C4.5)
CVP:Critical Value Pruning(finished for C4.5)
CCP:Cost Complexity Pruning(finished for CART)
ECP:Error Complexity Pruning(finished for CART)
Tips:
The command to delete all .pyc files:
find . -name "*.pyc" | xargs rm -f
Environment
Ubuntu Linux 16.04-Amd64
Python 2.7.12
Note that :
①Except ID3,MEP、EBP、PEP、CVP are operated on the model generated from:
Quinlan-C4.5-Release8_and_python_interface_for_EBP/Src/quinlan-src/
which is from
http://www.rulequest.com/Personal/c4.5r8.tar.gz
its author is Ross Quinlan.
The model from Quinlan's implemention is C-type,and it will be transformed to be "Python-compliant" model afterwards.
②datasets with unKnown value is Not Supported,because under different cases or in different papers,different people have different methods to deal with unknown value.
③The "C4.5-Release decision tree model" produced by quinlan's implemention in the above link is C-type model,Not python-type model,So we need to do transformation before running pruning algorithm.
For further details about transformation,continue reading the following contents please.
④when your datasets is very very small ,you'll get very small model,then,you will NOT get a "Simplified Tree"(pruned model)from quinlan's implemention. This means "pruned model"="unpruned model", when this happend,copy the content of "result/unprune.txt" to "result/prune.txt"please.
----------------REP---Operation method(start)---------------------------------
For REP(Reduced Error Pruning) of ID3:
cd ID3-REP-post_prune-Python-draw
python jianzhi.py
Attention:
continuous feature and feature with unKnown value are both Not supported currently.
----------------REP---Operation method(end)---------------------------------
----------------EBP--Operation method(start)----------------------------------
Ross Quinlan has already implemented EBP with C, so the following is just a Python interface.
For EBP(Error Based Pruning),Operation method is:
For any datasets choosed,you need prepare two files:
for example:
crx.names
crx.data.
and put them under the path:
Quinlan-C4.5-Release8_and_python_interface_for_EBP/Src/quinlan-src/
cd Quinlan-C4.5-Release8_and_python_interface_for_EBP/Src
python shell_execute.py crx > result.txt(change "crx" here when you use other datasets please.)
python result_get.py(transform C model to Python model)
python predict.py
It's an Python interface for C-type code in:
http://www.rulequest.com/Personal/c4.5r8.tar.gz
This interface is used to get model and EBP pruned model from Quinlan's implemention.
----------------EBP--Operation method(end)---------------------------------
----------------PEP--Operation method(start)---------------------------------
For PEP(Pessimistic Error Pruning):
1.download datasets from:
https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/
2.reorder it with the final column from small to large and get the former 200 items,and save them as abalone_parts.data(this step is just for easy to visualize afterwards)
3.
cp abalone_parts.data abalone.names /decision_tree/Quinlan-C4.5-Release8_and_python_interface_for_EBP/Src/quinlan-src/
python shell_execute.py abalone > result.txt
python result_get.py(transform C model to Python model)
get the model from the output of "python result_get.py"
and paste this model into top.py
cp abalone_parts.data abalone.names decision_tree/PEP-finish/
python top.py
If you do not want to change datasets,
you can skip the former 4 steps and run step 5th only.
----------------PEP--Operation method(end)---------------------------------
----------------MEP--Operation method(start)---------------------------------
1.cp abalone_parts.data abalone.names /decision_tree/Quinlan-C4.5-Release8_and_python_interface_for_EBP/Src/quinlan-src/
python shell_execute.py abalone > result.txt
python result_get.py(transform C model to Python model)
2.get the model from the output of "python result_get.py"
and paste this model into MEP_topmodule_down_to_top.py
3.python MEP_topmodule_down_to_top.py
----------------MEP--Operation method(end)---------------------------------
----------------CVP--Operation method(start)---------------------------------
1.cp abalone_parts.data abalone.names /decision_tree/Quinlan-C4.5-Release8_and_python_interface_for_EBP/Src/quinlan-src/
python shell_execute.py abalone > result.txt
python result_get.py(transform C model to Python model)
2.get the model from the output of "python result_get.py"
and paste this model into CVP_top.py
3.search "critical_value" in CVP_top.py and change it to be what you want.
4.python CVP_top.py
of course,you can skip the first 3 steps if you just want to see its performance with default datasets(abalone_parts and credit-a)
----------------CVP--Operation method(end)---------------------------------
----------------CCP--Operation method(some relevant information-start)---------------------------------
Attention: By far,there have are 3 implementions of CCP on CART by others,but all of them have defects.
https://github.com/Rudo-erek/decision-tree/tree/master/data
It can NOT work,and it can NOT deal with continuous Attribute. The computation of R(t) is wrong,this link use Gini to compute R(t)
https://github.com/Jasper-Dong/Decision-Tree
It can work,but it can NOT deal with continuous Attribute
https://triangleinequality.wordpress.com/2013/09/01/decision-trees-part-3-pruning-your-tree/
It can work,but the author modified the original CCP,which result in no candidate trees to select,
and so no cross-validation to select best pruned tree.
He set a fixed alpha,before run this"modified CCP",the method is to pursue
All of above three implementions are stored and annotated in the folder:"several_wrong_implementions_CCP"
Note1:
This repository's CCP-implemention is performed based on sklearn's CART classification-model.
Here's the relevant issuse of official github of Sklearn.
https://github.com/scikit-learn/scikit-learn/issues/6557
They are trying to prune on sklearn's CART model with Cython(a faster python with C),which is still un-available.My CCP implemention on sklearn's CART model is pure python.
Note2:
CCP can also be used on C4.5(Not supported in this github),and the article which use CCP on C4.5 is《Simplifying Decision Trees》part2.1-J.R.Quinlan
----------------CCP--Operation method(some relevant information-end)---------------------------------
----------------CCP--Operation method(start)---------------------------------
Attention again,datasets with unKnown value is NOT supported.
The main flow to implement CCP on sklearn's model is as follows:
1.transform sklearn model to json-model.
(I'm Not contributor of Sklearn,so the sklearn model can NOT be pruned directly,it need transformation.)
2.perform CCP on json model
3.get the best json-model from Tree Sets in CCP,and synchronized the original sklearn model with the best json-model
(we only synchronize the"Tree shape" between sklearn-model and json-style model,which is very helpful for drawing CCP-pruned json-model with graphviz)
The step to run CCP on sklearn's model is as follows:
1.delete all the files in the folder "visualization"
2.make sure your datasets has no unKnown value,or you need to pre-process it(It is a must).
If you don't pre-process the unKnown value,strange errors will occur.
3.cd decision_tree/sklearn_cart_CCP/sklearnCART2json_CCP/
and change the "name_path" and "data_path" in sklearn_CCP_top.py
4.python sklearn_CCP_top.py
and get the best CCP-json-model and CCP-pruned precision
5.Enjoy the pictures of all the TreeSets and "final bestTree" in the folder "visualization".
Note:
1.
Don't Set "max_depth" too large,
or you'll need to wait for a very long time,
if you persist running it with very large "max_depth",then graphviz may NOT be able to draw pictures under the folder "visualization".
2.
datasets from UCI which have been tested:
credit-a
abalone
----------------CCP--Operation method(end)---------------------------------
-------------------ECP-Operation method(start)-----------------------
1.change your datasets path in file sklearn_ECP_TOP.py
2.set b_SE=True in sklearn_ECP_TOP.py if you want this rule to select the best pruned tree.
3.python sklearn_ECP_TOP.py in the path decision_tree/sklearn_cart-regression_ECP-finish/
4.Enjoy the results in the folder"visualization".
datasets from UCI which have been tested:
housing(boston)
-------------------ECP-Operation method(end)-----------------------
----------------------------------------
You may also interested in the inventer、history of Pruning Algorithms,and you may want to compare the unpruned effects and pruned effects.
I have collected them together in the following Link:
https://blog.csdn.net/appleyuchi/article/details/83692381
Don't hesitate to contact appleyuchi@foxmail.com please if you have any question.