Score (prediction) calculated by xgboost does not match what is expected from tree model generated by xgboost
srangwal opened this issue · comments
TL;DR
I am seeing that, due to rounding errors, the score calculated by xgboost for a tree ensemble does not match the one expected based on manually looking at the tree model.
Background
We have our own implementation of tree scoring and while comparing the score that our library generates for a given tree model with the score (prediction) that xgboost comes up with for the same tree model we find that due to rounding errors somewhere the tree traversal the score done match
Using some training data we trained a tree ensemble using xgboost and the xgboost outputs the following model
TREE MODEL
booster[0]:
0:[feature_7<0.0397409] yes=1,no=2,missing=1
1:[feature_3<0.0667176] yes=3,no=4,missing=3
3:[feature_4<0.312636] yes=7,no=8,missing=7
7:leaf=-0.329882
8:leaf=-0.273303
4:[feature_1<0.350854] yes=9,no=10,missing=9
9:leaf=-0.259339
10:leaf=-0.181342
2:[feature_3<0.154725] yes=5,no=6,missing=5
5:[feature_5<0.0974576] yes=11,no=12,missing=11
11:leaf=-0.219728
12:leaf=-0.131518
6:[feature_10<0.0839368] yes=13,no=14,missing=13
13:leaf=-0.126043
14:leaf=-0.0395797
booster[1]:
0:[feature_9<0.033038] yes=1,no=2,missing=1
1:[feature_3<0.0765929] yes=3,no=4,missing=3
3:[feature_5<0.0953409] yes=7,no=8,missing=7
7:leaf=-0.264299
8:leaf=-0.198648
4:[feature_7<0.0226379] yes=9,no=10,missing=9
9:leaf=-0.203953
10:leaf=-0.13269
2:[feature_2<0.190862] yes=5,no=6,missing=5
5:[feature_5<0.130523] yes=11,no=12,missing=11
11:leaf=-0.147195
12:leaf=-0.0626878
6:[feature_2<0.528541] yes=13,no=14,missing=14
13:leaf=-0.0245767
14:leaf=0.0536115
booster[2]:
0:[feature_9<0.0270387] yes=1,no=2,missing=1
1:[feature_1<0.445829] yes=3,no=4,missing=3
3:[feature_6<0.103294] yes=7,no=8,missing=7
7:leaf=-0.242124
8:leaf=-0.190886
4:[feature_2<0.50734] yes=9,no=10,missing=9
9:leaf=-0.157586
10:leaf=-0.0441379
2:[feature_2<0.130774] yes=5,no=6,missing=5 ←----------------- Different branching
5:[feature_11<0.0535646] yes=11,no=12,missing=11
11:leaf=-0.154575
12:leaf=-0.0750511
6:[feature_12<0.562536] yes=13,no=14,missing=13
13:leaf=-0.0398532
14:leaf=0.051261
For this test data
TEST DATA
feature_1=1.0
feature_2=0.13077405095100403
feature_3=0.11696787178516388
feature_4=1.0
feature_5=0.17436540126800537
feature_6=1.0
feature_7=0.02141261100769043
feature_9=0.05511551350355148
feature_10=0.08659037202596664
Xgboost score = 0.401647
Score with our own library = 0.4295020650820223
(NOTE: scores of individual trees are added (summation) and score = sigmod(sum of scores from each tree)
)
For the line marked different branching one can deduce that our library is evaluating the condition to be false, and hence ends up calculating score -0.0398532
as the score of the third tree.
Based on the score generated by xgboost one can deduce that this same condition is evaluated as true by xgboost and xgboost ends up calculating score -0.154575
as the score of the third tree.
It's most likely the differences in rounding.
Trees are printed using default std::stringstream precision
xgboost/src/tree/tree_model.cc
Line 78 in 51154f4
which usually means that the float split values are represented with 6 meaningful digits, as you may see from your example. And default rounding, if I remember correctly, is towards zero.
You might try the following hack in order to see more digits in the split value: add fo.precision(18);
after that line and rebuild.
Thanks @khotilov. That helped. Would it be a good idea to set the precision to the highest precision of float in xgboost so as to avoid discrepancies in what xgboost uses as splitvalue/score and what other libraries using the output of xgboost use during scoring. If so, I can create a pull request
The same would be required for prediction value as well (
Line 312 in 51154f4