azmfaridee / mothur

This is GSoC2012 fork of 'Mothur'. We are trying to implement a number of 'Feature Selection' algorithms for microbial ecology data and incorporate them into mother's main codebase.

Home Page:https://github.com/mothur/mothur

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Week 7: Finish the Implementation of Random Forest Algorithm

azmfaridee opened this issue · comments

Related Issues: #3, #14, #15

As the per the initial proposal, code the remaining part of the Random Forest algorithm and finish the implementations. Most of this weeks task week be wrapping up the works that was done in Week 5 and Week 6. Since we are now able to create decision trees from the bootstrapped data, it's now time to aggregate the results and build the ensemble classifier feature selector. Specific functions of interests are:

calculateTreeErrorRate()
calculateAttributeImportance()
randomlyShuffleAttribute()
updateOutOfBagEstimate()
calculateForestWideAttributeImportance()
calculateForestWideErrorRate()

End of Week Deliverable

  • A finished code for the feature selection method that can work with microbial ecology data.

Right now the most pressing issue is the handing of the sparse data that we have, a lot of features are zero filled, while others are just same data repeated over. This features need to be avoided, but this might also hurt the decision tree. I'm currently investigating ways to find the tradeoff between them

Here are two sample tree generated for inpatient.final.an.0.03.subsample.avg.matrix file. The file contains 187 training samples and 845 features (OTUs).

Tree 1

root [ gen: 0 ] ( 536 < X2 )
     leftChild [ gen: 1 ] ( 42 < X22 )
          leftChild [ gen: 2 ] ( 9 < X31 )
               leftChild [ gen: 3 ] ( 2 < X36 )
                    leftChild [ gen: 4 ] ( 8 < X33 )
                         leftChild [ gen: 5 ] ( 3 < X37 )
                              leftChild [ gen: 6 ] ( 1 < X2 )
                                   leftChild [ gen: 7 ] ( classified to: 0, samples: 13 )
                                   rightChild [ gen: 7 ] ( 131 < X7 )
                                        leftChild [ gen: 8 ] ( 2 < X24 )
                                             leftChild [ gen: 9 ] ( 1 < X3 )
                                                  leftChild [ gen: 10 ] ( classified to: 0, samples: 2 )
                                                  rightChild [ gen: 10 ] ( 1056 < X1 )
                                                       leftChild [ gen: 11 ] ( classified to: 1, samples: 18 )
                                                       rightChild [ gen: 11 ] ( classified to: 0, samples: 1 )
                                             rightChild [ gen: 9 ] ( 302 < X1 )
                                                  leftChild [ gen: 10 ] ( classified to: 1, samples: 3 )
                                                  rightChild [ gen: 10 ] ( classified to: 0, samples: 6 )
                                        rightChild [ gen: 8 ] ( classified to: 0, samples: 6 )
                              rightChild [ gen: 6 ] ( classified to: 1, samples: 10 )
                         rightChild [ gen: 5 ] ( classified to: 1, samples: 14 )
                    rightChild [ gen: 4 ] ( 61 < X27 )
                         leftChild [ gen: 5 ] ( 997 < X7 )
                              leftChild [ gen: 6 ] ( classified to: 0, samples: 25 )
                              rightChild [ gen: 6 ] ( classified to: 1, samples: 1 )
                         rightChild [ gen: 5 ] ( classified to: 1, samples: 3 )
               rightChild [ gen: 3 ] ( 27 < X31 )
                    leftChild [ gen: 4 ] ( classified to: 1, samples: 21 )
                    rightChild [ gen: 4 ] ( 9 < X22 )
                         leftChild [ gen: 5 ] ( 52 < X34 )
                              leftChild [ gen: 6 ] ( classified to: 1, samples: 8 )
                              rightChild [ gen: 6 ] ( classified to: 0, samples: 2 )
                         rightChild [ gen: 5 ] ( classified to: 0, samples: 3 )
          rightChild [ gen: 2 ] ( 1 < X21 )
               leftChild [ gen: 3 ] ( classified to: 1, samples: 4 )
               rightChild [ gen: 3 ] ( classified to: 0, samples: 29 )
     rightChild [ gen: 1 ] ( classified to: 0, samples: 18 )

Tree 2

root [ gen: 0 ] ( 492 < X2 )
     leftChild [ gen: 1 ] ( 1 < X22 )
          leftChild [ gen: 2 ] ( 10 < X3 )
               leftChild [ gen: 3 ] ( 249 < X2 )
                    leftChild [ gen: 4 ] ( 43 < X27 )
                         leftChild [ gen: 5 ] ( classified to: 1, samples: 31 )
                         rightChild [ gen: 5 ] ( 676 < X1 )
                              leftChild [ gen: 6 ] ( classified to: 1, samples: 4 )
                              rightChild [ gen: 6 ] ( classified to: 0, samples: 1 )
                    rightChild [ gen: 4 ] ( 286 < X2 )
                         leftChild [ gen: 5 ] ( classified to: 0, samples: 2 )
                         rightChild [ gen: 5 ] ( classified to: 1, samples: 4 )
               rightChild [ gen: 3 ] ( 2 < X2 )
                    leftChild [ gen: 4 ] ( classified to: 0, samples: 9 )
                    rightChild [ gen: 4 ] ( 2 < X33 )
                         leftChild [ gen: 5 ] ( 583 < X1 )
                              leftChild [ gen: 6 ] ( classified to: 0, samples: 4 )
                              rightChild [ gen: 6 ] ( 852 < X1 )
                                   leftChild [ gen: 7 ] ( classified to: 1, samples: 4 )
                                   rightChild [ gen: 7 ] ( classified to: 0, samples: 1 )
                         rightChild [ gen: 5 ] ( classified to: 1, samples: 10 )
          rightChild [ gen: 2 ] ( 223 < X7 )
               leftChild [ gen: 3 ] ( 13 < X33 )
                    leftChild [ gen: 4 ] ( 1 < X7 )
                         leftChild [ gen: 5 ] ( 797 < X1 )
                              leftChild [ gen: 6 ] ( classified to: 1, samples: 7 )
                              rightChild [ gen: 6 ] ( classified to: 0, samples: 1 )
                         rightChild [ gen: 5 ] ( 23 < X22 )
                              leftChild [ gen: 6 ] ( 3 < X33 )
                                   leftChild [ gen: 7 ] ( 1 < X24 )
                                        leftChild [ gen: 8 ] ( classified to: 0, samples: 6 )
                                        rightChild [ gen: 8 ] ( 105 < X24 )
                                             leftChild [ gen: 9 ] ( 117 < X3 )
                                                  leftChild [ gen: 10 ] ( classified to: 1, samples: 8 )
                                                  rightChild [ gen: 10 ] ( classified to: 0, samples: 1 )
                                             rightChild [ gen: 9 ] ( classified to: 0, samples: 4 )
                                   rightChild [ gen: 7 ] ( 657 < X1 )
                                        leftChild [ gen: 8 ] ( classified to: 0, samples: 11 )
                                        rightChild [ gen: 8 ] ( classified to: 1, samples: 1 )
                              rightChild [ gen: 6 ] ( classified to: 0, samples: 31 )
                    rightChild [ gen: 4 ] ( 196 < X2 )
                         leftChild [ gen: 5 ] ( classified to: 1, samples: 7 )
                         rightChild [ gen: 5 ] ( 243 < X1 )
                              leftChild [ gen: 6 ] ( classified to: 0, samples: 1 )
                              rightChild [ gen: 6 ] ( classified to: 1, samples: 1 )
               rightChild [ gen: 3 ] ( 141 < X21 )
                    leftChild [ gen: 4 ] ( classified to: 1, samples: 12 )
                    rightChild [ gen: 4 ] ( 1 < X3 )
                         leftChild [ gen: 5 ] ( classified to: 0, samples: 1 )
                         rightChild [ gen: 5 ] ( classified to: 1, samples: 1 )
     rightChild [ gen: 1 ] ( 17 < X27 )
          leftChild [ gen: 2 ] ( 963 < X2 )
               leftChild [ gen: 3 ] ( classified to: 0, samples: 11 )
               rightChild [ gen: 3 ] ( 976 < X2 )
                    leftChild [ gen: 4 ] ( classified to: 1, samples: 2 )
                    rightChild [ gen: 4 ] ( classified to: 0, samples: 10 )
          rightChild [ gen: 2 ] ( classified to: 1, samples: 1 )

I'm trying to verify the validity of these generated trees.

This calculation is wrong, the entropy values that it is calculating is wrong, as none of these split provides total classification, yet one of the split entropies is listed at 0. Need to fix this.

tryIndex 36
getMinEntropyOfFeature()
featureOutputPair [[3, 0], [3, 1], [3, 0], [3, 1], [3, 1], [3, 1], [3, 1], [3, 0], [3, 0], [4, 0], [4, 0], [4, 0], [4, 0], [4, 0], [4, 0], [5, 0], [5, 0], [6, 0], [6, 0], [7, 0], [9, 0], [9, 0], [11, 0], [13, 0], [13, 0], [13, 0], [13, 0], [45, 0], [45, 0]]
splitPoints: [1, 2, 3, 7]
getBestSplitAndMinEntropy()
upperEntropyOfSplit: 0.0 lowerEntropyOfSplit: 0.676941869781
numLessThanValueAtSplitPoint: 0 numGreaterThanValueAtSplitPoint: 29
upperEntropyOfSplit: 1.0 lowerEntropyOfSplit: 0.605186576633
numLessThanValueAtSplitPoint: 0 numGreaterThanValueAtSplitPoint: 29
upperEntropyOfSplit: 0.918295834054 lowerEntropyOfSplit: 0.619382194679
numLessThanValueAtSplitPoint: 0 numGreaterThanValueAtSplitPoint: 29
upperEntropyOfSplit: 0.863120568567 lowerEntropyOfSplit: 0.0
numLessThanValueAtSplitPoint: 0 numGreaterThanValueAtSplitPoint: 29
entropies: [0.676941869780886, 0.6051865766334206, 0.6193821946787638, 0.0]

This calculation is wrong, the entropy values that it is calculating is wrong, as none of these split provides total classification, yet one of the split entropies is listed at 0. Need to fix this.

tryIndex 36
getMinEntropyOfFeature()
featureOutputPair [[3, 0], [3, 1], [3, 0], [3, 1], [3, 1], [3, 1], [3, 1], [3, 0], [3, 0], [4, 0], [4, 0], [4, 0], [4, 0], [4, 0], [4, 0], [5, 0], [5, 0], [6, 0], [6, 0], [7, 0], [9, 0], [9, 0], [11, 0], [13, 0], [13, 0], [13, 0], [13, 0], [45, 0], [45, 0]]
splitPoints: [1, 2, 3, 7]
getBestSplitAndMinEntropy()
upperEntropyOfSplit: 0.0 lowerEntropyOfSplit: 0.676941869781
numLessThanValueAtSplitPoint: 0 numGreaterThanValueAtSplitPoint: 29
upperEntropyOfSplit: 1.0 lowerEntropyOfSplit: 0.605186576633
numLessThanValueAtSplitPoint: 0 numGreaterThanValueAtSplitPoint: 29
upperEntropyOfSplit: 0.918295834054 lowerEntropyOfSplit: 0.619382194679
numLessThanValueAtSplitPoint: 0 numGreaterThanValueAtSplitPoint: 29
upperEntropyOfSplit: 0.863120568567 lowerEntropyOfSplit: 0.0
numLessThanValueAtSplitPoint: 0 numGreaterThanValueAtSplitPoint: 29
entropies: [0.676941869780886, 0.6051865766334206, 0.6193821946787638, 0.0]

The implementation is wrong, because the split points are calculated prior and these entropies are calculated at that split point, but the value of split point is calculated later, and if some of the values are same that belong to different split points, then the algorithm generates wrong answer. The implementation is wrong, because similar values would belong to a single split point in the first place.

The implementation is wrong, because the split points are calculated prior and these entropies are calculated at that split point, but the value of split point is calculated later, and if some of the values are same that belong to different split points, then the algorithm generates wrong answer. The implementation is wrong, because similar values would belong to a single split point in the first place.

This issue has been solved in the last commit.

Here is an sample tree generated from the updated code that has an error rate of 26.15%. i.e. it can successfully classify roughly 74% of the cases.

root [ gen: 0 ] ( 3 < X74 )
    leftChild [ gen: 1 ] ( 1 < X207 )
        leftChild [ gen: 2 ] ( 16 < X117 )
            leftChild [ gen: 3 ] ( 13 < X33 )
                leftChild [ gen: 4 ] ( 2 < X89 )
                    leftChild [ gen: 5 ] ( 3 < X677 )
                        leftChild [ gen: 6 ] ( 22 < X11 )
                            leftChild [ gen: 7 ] ( 1 < X9 )
                                leftChild [ gen: 8 ] ( 492 < X2 )
                                    leftChild [ gen: 9 ] ( 14 < X273 )
                                        leftChild [ gen: 10 ] ( 63 < X41 )
                                            leftChild [ gen: 11 ] ( 10 < X10 )
                                                leftChild [ gen: 12 ] ( 1 < X154 )
                                                    leftChild [ gen: 13 ] ( 1 < X409 )
                                                        leftChild [ gen: 14 ] ( classified to: 0, samples: 30 )
                                                        rightChild [ gen: 14 ] ( classified to: 1, samples: 1 )
                                                    rightChild [ gen: 13 ] ( 2 < X22 )
                                                        leftChild [ gen: 14 ] ( classified to: 1, samples: 1 )
                                                        rightChild [ gen: 14 ] ( classified to: 0, samples: 5 )
                                                rightChild [ gen: 12 ] ( classified to: 1, samples: 2 )
                                            rightChild [ gen: 11 ] ( classified to: 1, samples: 1 )
                                        rightChild [ gen: 10 ] ( classified to: 1, samples: 1 )
                                    rightChild [ gen: 9 ] ( classified to: 0, samples: 16 )
                                rightChild [ gen: 8 ] ( 24 < X100 )
                                    leftChild [ gen: 9 ] ( 1 < X672 )
                                        leftChild [ gen: 10 ] ( 3 < X144 )
                                            leftChild [ gen: 11 ] ( 845 < X1 )
                                                leftChild [ gen: 12 ] ( classified to: 1, samples: 33 )
                                                rightChild [ gen: 12 ] ( 1 < X24 )
                                                    leftChild [ gen: 13 ] ( classified to: 1, samples: 5 )
                                                    rightChild [ gen: 13 ] ( classified to: 0, samples: 1 )
                                            rightChild [ gen: 11 ] ( 2 < X31 )
                                                leftChild [ gen: 12 ] ( 60 < X9 )
                                                    leftChild [ gen: 13 ] ( classified to: 0, samples: 1 )
                                                    rightChild [ gen: 13 ] ( classified to: 1, samples: 1 )
                                                rightChild [ gen: 12 ] ( classified to: 0, samples: 1 )
                                        rightChild [ gen: 10 ] ( classified to: 0, samples: 1 )
                                    rightChild [ gen: 9 ] ( classified to: 0, samples: 2 )
                            rightChild [ gen: 7 ] ( classified to: 1, samples: 5 )
                        rightChild [ gen: 6 ] ( classified to: 1, samples: 4 )
                    rightChild [ gen: 5 ] ( 1 < X27 )
                        leftChild [ gen: 6 ] ( classified to: 1, samples: 2 )
                        rightChild [ gen: 6 ] ( classified to: 0, samples: 10 )
                rightChild [ gen: 4 ] ( 29 < X338 )
                    leftChild [ gen: 5 ] ( 6 < X44 )
                        leftChild [ gen: 6 ] ( 30 < X9 )
                            leftChild [ gen: 7 ] ( 245 < X24 )
                                leftChild [ gen: 8 ] ( 43 < X27 )
                                    leftChild [ gen: 9 ] ( classified to: 1, samples: 14 )
                                    rightChild [ gen: 9 ] ( classified to: 0, samples: 1 )
                                rightChild [ gen: 8 ] ( classified to: 0, samples: 1 )
                            rightChild [ gen: 7 ] ( classified to: 1, samples: 9 )
                        rightChild [ gen: 6 ] ( 36 < X34 )
                            leftChild [ gen: 7 ] ( classified to: 0, samples: 2 )
                            rightChild [ gen: 7 ] ( classified to: 1, samples: 1 )
                    rightChild [ gen: 5 ] ( classified to: 0, samples: 1 )
            rightChild [ gen: 3 ] ( classified to: 1, samples: 8 )
        rightChild [ gen: 2 ] ( classified to: 1, samples: 3 )
    rightChild [ gen: 1 ] ( 1 < X2 )
        leftChild [ gen: 2 ] ( 7 < X31 )
            leftChild [ gen: 3 ] ( classified to: 0, samples: 6 )
            rightChild [ gen: 3 ] ( 1 < X29 )
                leftChild [ gen: 4 ] ( classified to: 1, samples: 4 )
                rightChild [ gen: 4 ] ( classified to: 0, samples: 1 )
        rightChild [ gen: 2 ] ( classified to: 0, samples: 13 )
treeErrorRate: 0.261538461538