Documentation

Question

Documentation

DeluxeAnalyst opened this issue 9 years ago · comments

Far0n,

Would you be able to write up a short little documentation on how to run this? I have downloaded Visual Studio Express, and loaded the project, but I am unfamiliar with what to do next. I tried clicking on the project and selecting "Start", some stuff runs but the output says it exited with a certain code and nothing else.

How do you tell the program where the dumpfile is? That might be my problem, but I couldnt figure it out.

Thanks

Far0n · Answer 1 · Fri Oct 16 2015 02:25:15 GMT+0800 (China Standard Time)

I will update this repo soon with detailed instructions, a hopefully proper documentation and a demo.

Quick Guide:
a) Within Visual Studio: Select "Release" within the dropdown next to the undo/redo buttons in the menu bar
b) Click BUILD -> Build Solution
c) Right-Click @ project XgbFeatureInteractions in the Solution Explorer and choose "Open Folder in File Explorer"
d) Navigate to bin/Release
e) Open "XgbFeatureInteractions.exe.config" with a text editor and edit the "XgbFeatureInteractions.Properties.Settings"
f) run XgbFeatureInteractions.exe

DeluxeAnalyst · Answer 2 · Fri Oct 16 2015 03:47:19 GMT+0800 (China Standard Time)

Thanks! I was able to run it using these instructions. Is there a way to have it calculate more than 0 and 1 depth interactions? I tried searching through some of the packages, and I saw where 0-depth was being calculated, but didnt see any reproducible code that could easily be converted to calculate 2-depths, 3-depths, etc

Thanks a lot for your work on this, its pretty amazing.

Far0n · Answer 3 · Fri Oct 16 2015 03:54:41 GMT+0800 (China Standard Time)

Thank you :)

There is a setting called "MaxInteractionDepth" in the "XgbFeatureInteractions.exe.config". If you set it to "-1" xgbfi is collecting interactions up to tree depth.

DeluxeAnalyst · Answer 4 · Fri Oct 16 2015 04:26:08 GMT+0800 (China Standard Time)

Awesome, that worked great.

One last thing that would really put the icing on the cake for me. Is there a way for you to have it add up the raw Leaf Values as well? Each of the Leaf values is the log-odds estimate for that specific ending node within each tree. When doing a logistic regression, the final predicted probability is calculated by adding up the ending leaf value across all trees for a given record, which is the total log-odds estimate for the record, and then converts it into a probability.

I would love to be able to see the sum of the leaf values for each n-way interaction. I think it would only be applicable for the max InteractionDepth of each tree.

The end result of this would almost be like having regression coefficients. With this X-Y-Z variable interaction, the total log-odds estimate over the whole model would be some value, and the probability could be calculated for it. This would show that when those variables interact with each other, the net results is some positive or negative impact on the predicted probability over the whole model. This would help me in explaining to my customers how certain variables and variable interactions are effecting the response variable.

Let me know what you think.

Far0n · Answer 5 · Fri Oct 16 2015 05:12:42 GMT+0800 (China Standard Time)

That sounds like a really nice idea! I'll implement this. I'm note sure yet, but given the cover it should be possible to prune the trees right away to get the log-odds for interactions of all depth.

I don't know if I got your last paragraph right: you want to compare p(X-Y-Z) with p(model) to judge the influence?

DeluxeAnalyst · Answer 6 · Fri Oct 16 2015 05:24:20 GMT+0800 (China Standard Time)

Great! I will look forward to it, I think it will be a great addition, even better if you think you can get it for interactions of all depths.

What I mean by the last paragraph, lets say there are three variables (X,Y,Z) and there are two trees in the model that have an interaction between those three variables. By adding together the leaf values, there is a total model predicted probability of that interaction. This would give some indication of how the interaction between these three variables impacts the response variable, whether positive or negative. Granted there is some interpretation based on how it is splitting those variables, but it might get close to understanding that relationship.

I will have to look at this output more to get a good idea of the correct way to interpret this, but I think it should add some level of extra interpretation.

It may not be as easy as just adding up leaf values, because each n-way interaction will have two leaf nodes, each having a different log-odds value. Unless you can just add them up and if the overall number is positive, the interaction has a net positive impact overall on the response variable.

It might end up being an average effect in predicted probability if you take that approach, which would still be good.

Thoughts?

Far0n · Answer 7 · Fri Oct 16 2015 18:24:41 GMT+0800 (China Standard Time)

I'm still not sure, whether we mean the same with the term interaction. To be sure, let's do it by example if you don't mind:

Suppose we have the two trees A & B. In both we got the "interaction" (F1,F2,F3), r8?

Let's define the function leaf_val(x,y) = z , where x is the tree, y is the leaf number and z are the log_odds, e.g. leaf_val(A,7) = a.

How does the output should look like?

DeluxeAnalyst · Answer 8 · Mon Oct 19 2015 20:43:23 GMT+0800 (China Standard Time)

I am on board with your definition of interaction here, and I think your example gets at the "problem" i was trying to identify in the last part of my last post, namely that each interaction in each tree has two leafs.

When I was first coming up with this idea, I was hoping that for the interaction (F1,F2,F3), we could simply do something like leaf_val(A,7) + leaf_val(B,7) = total interaction leaf value. But this ignores leaf_val(A,8) and leaf_val(B,8). So it seems like the correct output would be the sum of all four leafs.

It might be possible to weight each of the leaf values though in some way, possibly by how many records fall into each leaf. If (A,7) has 900 records, and (A,8) has 100, we might not want each one contributing the same amount to the final output.

As I think about this more, a better option might be to instead look at the values of each split, and then sum all the leaf values under that split. I find that there is usually repeating splits among trees, such that F3 might be splitting on the same criteria in both A and B. In this case, lets say that in both A and B, Split[2] for F3 is: F3 <= 150.

We could calculate the following as two separate values:

(F3 <= 150) = (A,7)+(A,8)+(B,7)+(B,8)
(F3 > 150) = (A,6) + (B,6)

From this, if 1 was a negative number, and 2 was a positive number we would know that when F3 is less than 150 it tends to mean a decrease in predicted probability, and when F3 > 150 it tends to mean an increase in predicted probability.

This is perhaps the better approach to calculating this "total leaf values" by interaction.

Thoughts?

Far0n · Answer 9 · Tue Oct 20 2015 13:38:40 GMT+0800 (China Standard Time)

That sounds interesting, but I would suggest the following to start with and iterating from there:

Collect the follwing stats for each feature interaction fi:

sum of left leaf nodes values
sum of right leaf nodes values
total amounf of samples in left leaf nodes
total amounf of samples in right leaf nodes

Thoughts? :)

DeluxeAnalyst · Answer 10 · Tue Oct 20 2015 20:12:25 GMT+0800 (China Standard Time)

I think that sounds like a great place to start!

Far0n · Answer 11 · Tue Oct 20 2015 20:52:01 GMT+0800 (China Standard Time)

So be it. ^^

Far0n · Answer 12 · Fri Oct 23 2015 21:44:48 GMT+0800 (China Standard Time)

rdy .. it creates a new sheet "Leaf Statistics" if MaxInteractionDepth is set to -1.

DeluxeAnalyst · Answer 13 · Sat Oct 24 2015 03:18:44 GMT+0800 (China Standard Time)

I just downloaded the new version, but it gets 24 errors during the build in Visual Express. I tried it on two computers and both gave the same result. I could give you the log, but I dont have permissions to write to this repository. but here is a screenshot of some of the error, it let me attach that.

Far0n · Answer 14 · Sat Oct 24 2015 03:39:40 GMT+0800 (China Standard Time)

The automatic download of the Nuget packages failed. Try to install them manually:
a) Right-Click @ project XgbFeatureInteractions in the Solution Explorer -> "Manage Nuget Packages..."
b) Install EPPlus & NGenerics

DeluxeAnalyst · Answer 15 · Mon Oct 26 2015 20:35:39 GMT+0800 (China Standard Time)

Got it working, thanks! One thing, it seems to only be showing me interactions for the variable that has the top gain, it is happening with the original feature interactions you did as well. Know why this might be happening?

It looks good though! I do think it might be good to go one step further now, and show specific node splits, and the sum of left leaves and sum of right leaves. I think this could just be done for all 0-interaction splits, since there would be a massive number of combinations, but this would give that extra actionability in understanding how specific variables are influencing the response. It would be useful to say, here is the most important variable and here are the different splits the model performs on that variable and their effect on the response variable.

What do you think?

Great job on this last part!

Far0n · Answer 16 · Mon Oct 26 2015 21:06:15 GMT+0800 (China Standard Time)

"One thing, it seems to only be showing me interactions for the variable that has the top gain, it is happening with the original feature interactions you did as well. Know why this might be happening?"

I'm not able to reproduce this. Could you send me a model dump, the xgbfi-config you used and the xgb params?

DeluxeAnalyst · Answer 17 · Mon Oct 26 2015 21:16:25 GMT+0800 (China Standard Time)

Nevermind on my error I was experiencing. The MaxDeeping was set at 0, and I was changing it in the Visual Express but the program for some reason wasnt picking up my change. I edited the raw config file in notepad and then it worked perfectly.

Far0n · Answer 18 · Wed Oct 28 2015 18:22:34 GMT+0800 (China Standard Time)

I would do the following now:
a) Calculating split value histograms (SVH) for each feature (0-way-interaction)
b) Calculating SVH for subsets of n-way-interactions
c) Calculating SVH for each root-2-leaf path

In order to shrink the output / handle the complexity, I'm considering some sort of querying mechanism. That is, one can define which features (or interactions) are of interest before the parsing starts.

DeluxeAnalyst · Answer 19 · Wed Oct 28 2015 20:52:35 GMT+0800 (China Standard Time)

I think that is a great idea, because it would become quite unwieldy if it calculated everything, so being able to choose is great, and gives more EDA power to the model.

Yuan Tang · Answer 20 · Fri Nov 20 2015 05:03:20 GMT+0800 (China Standard Time)

@Far0n Is there an easy and one-click way to use this? (without VS, just one executable, etc)

DeluxeAnalyst · Answer 21 · Fri Nov 20 2015 05:50:20 GMT+0800 (China Standard Time)

@terrytangyuan How it works is that you have to compile it once in VS, and then there is an executable created that is just oneclick.

@Far0n Correct me if im wrong, you should be able to compile it on your end and then publish it here with the executable correct? That way it isnt needed for people to compile it in VS themselves.

Far0n · Answer 22 · Fri Nov 20 2015 05:52:38 GMT+0800 (China Standard Time)

Yeah, I can put a pre-compiled binary in the repo. that should work fine on windows as well as linux.

Far0n · Answer 23 · Fri Nov 20 2015 05:57:45 GMT+0800 (China Standard Time)

@terrytangyuan I comitted the compiled binary

Yuan Tang · Answer 24 · Sat Nov 21 2015 04:28:58 GMT+0800 (China Standard Time)

@Far0n Thanks! But it looks like I cannot run it on mac though :-(

Far0n · Answer 25 · Sat Nov 21 2015 17:40:39 GMT+0800 (China Standard Time)

@terrytangyuan not working in conjunction with mono?

DeluxeAnalyst · Answer 26 · Tue Nov 24 2015 23:44:25 GMT+0800 (China Standard Time)

@Far0n Is there a way for the left and right leaf values portion for it to show the calculations for each individual variable with no interactions?

Far0n · Answer 27 · Tue Nov 24 2015 23:52:51 GMT+0800 (China Standard Time)

@DeluxeAnalyst I'm not sure I can follow. What do you mean by "calculations" here?

DeluxeAnalyst · Answer 28 · Tue Nov 24 2015 23:58:03 GMT+0800 (China Standard Time)

I just mean the sum of all leafs to the left, sum of all leafs to the right. Right now it is doing that calculation for all interaction depths, and I was curious to want to see for each individual variable in the model, what is the sum of all leaves to the left and right.

Far0n · Answer 29 · Wed Nov 25 2015 00:24:03 GMT+0800 (China Standard Time)

I'm curios whether that is well defined for a single feature. Consider tree A from above (Oct. 16 post)
What are the left leaves of feature F1:

leaves 3 & 7 (because they are the left leaves of all subtrees)

or

leaves 3 & 4 (because they are the leaves of the left subtree of root)

DeluxeAnalyst · Answer 30 · Wed Nov 25 2015 00:31:02 GMT+0800 (China Standard Time)

It would be the second situation, leaves 3 and 4 which I believe is supported with the xgb output.
For Feature F2 it would be 3 and 7.
For Feature F4 it would be 7 and 8.

If the left route always represents lower values of the feature, (which I believe is the case), there can be a sort of pseudo correlation discovered between each feature and the response variable.

What would be one step further would be finding each Feature and split of that feature, like we discussed before (where you talked about being able to tell the tool which variable you are interested in). In this case, for Tree A we would get two values for F2, at split 1 and split 5.

Far0n · Answer 31 · Wed Nov 25 2015 00:34:58 GMT+0800 (China Standard Time)

Both shouldn't be a problem.

DeluxeAnalyst · Answer 32 · Wed Nov 25 2015 00:36:49 GMT+0800 (China Standard Time)

That would be awesome :) I have been working with a lot of xgboost lately, and with those extra two features this would be the perfect xgboost companion tool.

Far0n · Answer 33 · Wed Nov 25 2015 00:37:56 GMT+0800 (China Standard Time)

You will get it :)

Phani Srikanth · Answer 34 · Thu Feb 18 2016 21:20:05 GMT+0800 (China Standard Time)

@Far0n Is there any way we could run this tool on mac, as asked by @terrytangyuan ?

I have tried using wine. Unfortunately, this is the error I get -
err:process:create_process L"Z:\Users\p\Documents\Softwares\xgbfi\bin\XgbFeatureInteractions.exe" not supported on this installation (x86_64 binary)
wine: Bad EXE format for Z:\Users\p\Documents\Softwares\xgbfi\bin\XgbFeatureInteractions.exe.

Far0n · Answer 35 · Thu Feb 18 2016 23:13:46 GMT+0800 (China Standard Time)

@binga mac is an unknown domain for me, but I can confirm, that xgbfi runs with mono.

Phani Srikanth · Answer 36 · Thu Feb 18 2016 23:31:13 GMT+0800 (China Standard Time)

Works like a charm with Mono! Thank you @Far0n
I'll raise any separate issue if I encounter any, I hope that should be fine!

Peatle · Answer 37 · Tue Sep 13 2016 18:03:16 GMT+0800 (China Standard Time)

@Far0n Thank you - this is a great tool. The readme is clear on the definition of gain for Interaction Depth 0 (ie single features) but could you confirm how the gain is defined for higher interaction depths? Using the example in the readme, what is the gain of F2 | F3? Is it eg the sum of all gains from F2 or F3 splits but only from trees which use both features? Or maybe only from F2 / F3 nodes which have a prior F3 / F2 in the tree's path?

Far0n · Answer 38 · Tue Sep 13 2016 19:33:36 GMT+0800 (China Standard Time)

@PeteLowth Similar to the latter, but only if F2 (or F3) is a direct predecessor.

Peatle · Answer 39 · Tue Sep 13 2016 20:28:48 GMT+0800 (China Standard Time)

@Far0n Thanks - that makes sense. In the case of an F2 followed directly by an F3, would you count the gain from both nodes or just the F3 node?

Far0n · Answer 40 · Wed Sep 14 2016 21:53:42 GMT+0800 (China Standard Time)

@PeteLowth from both nodes

Yihong Chen · Answer 41 · Mon Jul 17 2017 10:21:15 GMT+0800 (China Standard Time)

Can this package be used in Linux ?

Sabrinaaaaaa · Answer 42 · Thu Nov 23 2017 03:28:38 GMT+0800 (China Standard Time)

@Far0n thank you for implementing this great tool! I am using it recently and it works pretty well. Is it possible to output more than 100 feature interactions?

Far0n · Answer 43 · Thu Nov 23 2017 23:12:33 GMT+0800 (China Standard Time)

@Sabrinaaaaaa yes, just edit

<setting name="TopK" serializeAs="String">
        <value>100</value>
</setting>

in XgbFeatureInteractions.exe.config

Sabrinaaaaaa · Answer 44 · Fri Nov 24 2017 05:03:13 GMT+0800 (China Standard Time)

Thank you so much! it works!

…

On Nov 23, 2017 7:12 AM, "Far0n" ***@***.***> wrote: @Sabrinaaaaaa <https://github.com/sabrinaaaaaa> yes, just edit <setting name="TopK" serializeAs="String"> <value>100</value> </setting> in XgbFeatureInteractions.exe.config — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AYVPX0ZQBe3GjVXfI1ypCv2ZsGoPZsjoks5s5YtigaJpZM4GPq9R> .

dksahuji · Answer 45 · Tue Jan 30 2018 12:08:05 GMT+0800 (China Standard Time)

Please add an example with two trees, so that i understand how to score is computed in general.

Thanks!