imbs-hl / timbR

Tree interpretation methods based on ranger

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

inconsistency in measure_distances(rf , "splitting variables")

s81320 opened this issue · comments

commented

I attached a script (had to add .txt to the .R) to show what I think might be inconsistent...

For the iris data (with 4 features / independent. variables) whenever one tree uses all 4 variables for splitting and another uses 3 the dissimilarity should be 1/4. In the current code we sometimes get 0.25 and sometimes 0.

For those trees that use 3 split variables, whenever they do not use the same 3 split variables, the dissimilarity should be 2/4. The current code returns 0 or 0.25 but never 0.5.

error_timbR.R.txt

Thank you very much for bringing this up. This issue is related to an inconsistency in ranger itself.
If you extract the splitVarIDs by calling the function treeInfo the splitting variables take number from 0 onwards, while leafs have a splitVarID of NA. If you call them by rf$forest$splitVarIds leafs also get 0. So 0 is used twice.
Now I changed it to getting the splitVarIDs by treeInfo, which should solve the problem.