Far0n / xgbfi

XGBoost Feature Interactions & Importance

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Documentation

DeluxeAnalyst opened this issue · comments

Far0n,

Would you be able to write up a short little documentation on how to run this? I have downloaded Visual Studio Express, and loaded the project, but I am unfamiliar with what to do next. I tried clicking on the project and selecting "Start", some stuff runs but the output says it exited with a certain code and nothing else.

How do you tell the program where the dumpfile is? That might be my problem, but I couldnt figure it out.

Thanks

commented

I will update this repo soon with detailed instructions, a hopefully proper documentation and a demo.

Quick Guide:
a) Within Visual Studio: Select "Release" within the dropdown next to the undo/redo buttons in the menu bar
b) Click BUILD -> Build Solution
c) Right-Click @ project XgbFeatureInteractions in the Solution Explorer and choose "Open Folder in File Explorer"
d) Navigate to bin/Release
e) Open "XgbFeatureInteractions.exe.config" with a text editor and edit the "XgbFeatureInteractions.Properties.Settings"
f) run XgbFeatureInteractions.exe

Thanks! I was able to run it using these instructions. Is there a way to have it calculate more than 0 and 1 depth interactions? I tried searching through some of the packages, and I saw where 0-depth was being calculated, but didnt see any reproducible code that could easily be converted to calculate 2-depths, 3-depths, etc

Thanks a lot for your work on this, its pretty amazing.

commented

Thank you :)

There is a setting called "MaxInteractionDepth" in the "XgbFeatureInteractions.exe.config". If you set it to "-1" xgbfi is collecting interactions up to tree depth.

Awesome, that worked great.

One last thing that would really put the icing on the cake for me. Is there a way for you to have it add up the raw Leaf Values as well? Each of the Leaf values is the log-odds estimate for that specific ending node within each tree. When doing a logistic regression, the final predicted probability is calculated by adding up the ending leaf value across all trees for a given record, which is the total log-odds estimate for the record, and then converts it into a probability.

I would love to be able to see the sum of the leaf values for each n-way interaction. I think it would only be applicable for the max InteractionDepth of each tree.

The end result of this would almost be like having regression coefficients. With this X-Y-Z variable interaction, the total log-odds estimate over the whole model would be some value, and the probability could be calculated for it. This would show that when those variables interact with each other, the net results is some positive or negative impact on the predicted probability over the whole model. This would help me in explaining to my customers how certain variables and variable interactions are effecting the response variable.

Let me know what you think.

commented

That sounds like a really nice idea! I'll implement this. I'm note sure yet, but given the cover it should be possible to prune the trees right away to get the log-odds for interactions of all depth.

I don't know if I got your last paragraph right: you want to compare p(X-Y-Z) with p(model) to judge the influence?

Great! I will look forward to it, I think it will be a great addition, even better if you think you can get it for interactions of all depths.

What I mean by the last paragraph, lets say there are three variables (X,Y,Z) and there are two trees in the model that have an interaction between those three variables. By adding together the leaf values, there is a total model predicted probability of that interaction. This would give some indication of how the interaction between these three variables impacts the response variable, whether positive or negative. Granted there is some interpretation based on how it is splitting those variables, but it might get close to understanding that relationship.

I will have to look at this output more to get a good idea of the correct way to interpret this, but I think it should add some level of extra interpretation.

It may not be as easy as just adding up leaf values, because each n-way interaction will have two leaf nodes, each having a different log-odds value. Unless you can just add them up and if the overall number is positive, the interaction has a net positive impact overall on the response variable.

It might end up being an average effect in predicted probability if you take that approach, which would still be good.

Thoughts?

commented

I'm still not sure, whether we mean the same with the term interaction. To be sure, let's do it by example if you don't mind:
logodss

Suppose we have the two trees A & B. In both we got the "interaction" (F1,F2,F3), r8?

Let's define the function leaf_val(x,y) = z , where x is the tree, y is the leaf number and z are the log_odds, e.g. leaf_val(A,7) = a.

How does the output should look like?

I am on board with your definition of interaction here, and I think your example gets at the "problem" i was trying to identify in the last part of my last post, namely that each interaction in each tree has two leafs.

When I was first coming up with this idea, I was hoping that for the interaction (F1,F2,F3), we could simply do something like leaf_val(A,7) + leaf_val(B,7) = total interaction leaf value. But this ignores leaf_val(A,8) and leaf_val(B,8). So it seems like the correct output would be the sum of all four leafs.

It might be possible to weight each of the leaf values though in some way, possibly by how many records fall into each leaf. If (A,7) has 900 records, and (A,8) has 100, we might not want each one contributing the same amount to the final output.

As I think about this more, a better option might be to instead look at the values of each split, and then sum all the leaf values under that split. I find that there is usually repeating splits among trees, such that F3 might be splitting on the same criteria in both A and B. In this case, lets say that in both A and B, Split[2] for F3 is: F3 <= 150.

We could calculate the following as two separate values:

  1. (F3 <= 150) = (A,7)+(A,8)+(B,7)+(B,8)
  2. (F3 > 150) = (A,6) + (B,6)

From this, if 1 was a negative number, and 2 was a positive number we would know that when F3 is less than 150 it tends to mean a decrease in predicted probability, and when F3 > 150 it tends to mean an increase in predicted probability.

This is perhaps the better approach to calculating this "total leaf values" by interaction.

Thoughts?

commented

That sounds interesting, but I would suggest the following to start with and iterating from there:

Collect the follwing stats for each feature interaction fi:

  • sum of left leaf nodes values
  • sum of right leaf nodes values
  • total amounf of samples in left leaf nodes
  • total amounf of samples in right leaf nodes

Thoughts? :)

I think that sounds like a great place to start!

commented

So be it. ^^

commented

rdy .. it creates a new sheet "Leaf Statistics" if MaxInteractionDepth is set to -1.

I just downloaded the new version, but it gets 24 errors during the build in Visual Express. I tried it on two computers and both gave the same result. I could give you the log, but I dont have permissions to write to this repository. but here is a screenshot of some of the error, it let me attach that.

xgbfi errors

commented

The automatic download of the Nuget packages failed. Try to install them manually:
a) Right-Click @ project XgbFeatureInteractions in the Solution Explorer -> "Manage Nuget Packages..."
b) Install EPPlus & NGenerics

Got it working, thanks! One thing, it seems to only be showing me interactions for the variable that has the top gain, it is happening with the original feature interactions you did as well. Know why this might be happening?

It looks good though! I do think it might be good to go one step further now, and show specific node splits, and the sum of left leaves and sum of right leaves. I think this could just be done for all 0-interaction splits, since there would be a massive number of combinations, but this would give that extra actionability in understanding how specific variables are influencing the response. It would be useful to say, here is the most important variable and here are the different splits the model performs on that variable and their effect on the response variable.

What do you think?

Great job on this last part!

commented

"One thing, it seems to only be showing me interactions for the variable that has the top gain, it is happening with the original feature interactions you did as well. Know why this might be happening?"

I'm not able to reproduce this. Could you send me a model dump, the xgbfi-config you used and the xgb params?

Nevermind on my error I was experiencing. The MaxDeeping was set at 0, and I was changing it in the Visual Express but the program for some reason wasnt picking up my change. I edited the raw config file in notepad and then it worked perfectly.

commented

I would do the following now:
a) Calculating split value histograms (SVH) for each feature (0-way-interaction)
b) Calculating SVH for subsets of n-way-interactions
c) Calculating SVH for each root-2-leaf path

In order to shrink the output / handle the complexity, I'm considering some sort of querying mechanism. That is, one can define which features (or interactions) are of interest before the parsing starts.

I think that is a great idea, because it would become quite unwieldy if it calculated everything, so being able to choose is great, and gives more EDA power to the model.

@Far0n Is there an easy and one-click way to use this? (without VS, just one executable, etc)

@terrytangyuan How it works is that you have to compile it once in VS, and then there is an executable created that is just oneclick.

@Far0n Correct me if im wrong, you should be able to compile it on your end and then publish it here with the executable correct? That way it isnt needed for people to compile it in VS themselves.

commented

Yeah, I can put a pre-compiled binary in the repo. that should work fine on windows as well as linux.

commented

@terrytangyuan I comitted the compiled binary

@Far0n Thanks! But it looks like I cannot run it on mac though :-(

commented

@terrytangyuan not working in conjunction with mono?

@Far0n Is there a way for the left and right leaf values portion for it to show the calculations for each individual variable with no interactions?

commented

@DeluxeAnalyst I'm not sure I can follow. What do you mean by "calculations" here?

I just mean the sum of all leafs to the left, sum of all leafs to the right. Right now it is doing that calculation for all interaction depths, and I was curious to want to see for each individual variable in the model, what is the sum of all leaves to the left and right.

commented

I'm curios whether that is well defined for a single feature. Consider tree A from above (Oct. 16 post)
What are the left leaves of feature F1:

  • leaves 3 & 7 (because they are the left leaves of all subtrees)

or

  • leaves 3 & 4 (because they are the leaves of the left subtree of root)

It would be the second situation, leaves 3 and 4 which I believe is supported with the xgb output.
For Feature F2 it would be 3 and 7.
For Feature F4 it would be 7 and 8.

If the left route always represents lower values of the feature, (which I believe is the case), there can be a sort of pseudo correlation discovered between each feature and the response variable.

What would be one step further would be finding each Feature and split of that feature, like we discussed before (where you talked about being able to tell the tool which variable you are interested in). In this case, for Tree A we would get two values for F2, at split 1 and split 5.

commented

Both shouldn't be a problem.

That would be awesome :) I have been working with a lot of xgboost lately, and with those extra two features this would be the perfect xgboost companion tool.

commented

You will get it :)

@Far0n Is there any way we could run this tool on mac, as asked by @terrytangyuan ?

I have tried using wine. Unfortunately, this is the error I get -
err:process:create_process L"Z:\Users\p\Documents\Softwares\xgbfi\bin\XgbFeatureInteractions.exe" not supported on this installation (x86_64 binary)
wine: Bad EXE format for Z:\Users\p\Documents\Softwares\xgbfi\bin\XgbFeatureInteractions.exe.

commented

@binga mac is an unknown domain for me, but I can confirm, that xgbfi runs with mono.

Works like a charm with Mono! Thank you @Far0n
I'll raise any separate issue if I encounter any, I hope that should be fine!

@Far0n Thank you - this is a great tool. The readme is clear on the definition of gain for Interaction Depth 0 (ie single features) but could you confirm how the gain is defined for higher interaction depths? Using the example in the readme, what is the gain of F2 | F3? Is it eg the sum of all gains from F2 or F3 splits but only from trees which use both features? Or maybe only from F2 / F3 nodes which have a prior F3 / F2 in the tree's path?

commented

@PeteLowth Similar to the latter, but only if F2 (or F3) is a direct predecessor.

@Far0n Thanks - that makes sense. In the case of an F2 followed directly by an F3, would you count the gain from both nodes or just the F3 node?

commented

@PeteLowth from both nodes

Can this package be used in Linux ?

@Far0n thank you for implementing this great tool! I am using it recently and it works pretty well. Is it possible to output more than 100 feature interactions?

commented

@Sabrinaaaaaa yes, just edit

<setting name="TopK" serializeAs="String">
        <value>100</value>
</setting>

in XgbFeatureInteractions.exe.config

Please add an example with two trees, so that i understand how to score is computed in general.

Thanks!