dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

Home Page:https://xgboost.readthedocs.io/en/stable/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Score (prediction) calculated by xgboost does not match what is expected from tree model generated by xgboost

srangwal opened this issue · comments

TL;DR

I am seeing that, due to rounding errors, the score calculated by xgboost for a tree ensemble does not match the one expected based on manually looking at the tree model.

Background

We have our own implementation of tree scoring and while comparing the score that our library generates for a given tree model with the score (prediction) that xgboost comes up with for the same tree model we find that due to rounding errors somewhere the tree traversal the score done match

Using some training data we trained a tree ensemble using xgboost and the xgboost outputs the following model

TREE MODEL

 booster[0]:
 0:[feature_7<0.0397409] yes=1,no=2,missing=1
         1:[feature_3<0.0667176] yes=3,no=4,missing=3
                 3:[feature_4<0.312636] yes=7,no=8,missing=7
                         7:leaf=-0.329882
                         8:leaf=-0.273303
                 4:[feature_1<0.350854] yes=9,no=10,missing=9
                         9:leaf=-0.259339
                         10:leaf=-0.181342
         2:[feature_3<0.154725] yes=5,no=6,missing=5
                 5:[feature_5<0.0974576] yes=11,no=12,missing=11
                         11:leaf=-0.219728
                         12:leaf=-0.131518
                 6:[feature_10<0.0839368] yes=13,no=14,missing=13
                         13:leaf=-0.126043
                         14:leaf=-0.0395797
 booster[1]:
 0:[feature_9<0.033038] yes=1,no=2,missing=1
         1:[feature_3<0.0765929] yes=3,no=4,missing=3
                 3:[feature_5<0.0953409] yes=7,no=8,missing=7
                         7:leaf=-0.264299
                         8:leaf=-0.198648
                 4:[feature_7<0.0226379] yes=9,no=10,missing=9
                         9:leaf=-0.203953
                         10:leaf=-0.13269
         2:[feature_2<0.190862] yes=5,no=6,missing=5
                 5:[feature_5<0.130523] yes=11,no=12,missing=11
                         11:leaf=-0.147195
                         12:leaf=-0.0626878
                 6:[feature_2<0.528541] yes=13,no=14,missing=14
                         13:leaf=-0.0245767
                         14:leaf=0.0536115
 booster[2]:
 0:[feature_9<0.0270387] yes=1,no=2,missing=1
         1:[feature_1<0.445829] yes=3,no=4,missing=3
                 3:[feature_6<0.103294] yes=7,no=8,missing=7
                         7:leaf=-0.242124
                         8:leaf=-0.190886
                 4:[feature_2<0.50734] yes=9,no=10,missing=9
                         9:leaf=-0.157586
                         10:leaf=-0.0441379
         2:[feature_2<0.130774] yes=5,no=6,missing=5         ←----------------- Different branching
                 5:[feature_11<0.0535646] yes=11,no=12,missing=11
                         11:leaf=-0.154575
                         12:leaf=-0.0750511
                 6:[feature_12<0.562536] yes=13,no=14,missing=13
                         13:leaf=-0.0398532
                         14:leaf=0.051261

For this test data

TEST DATA

feature_1=1.0
feature_2=0.13077405095100403
feature_3=0.11696787178516388
feature_4=1.0
feature_5=0.17436540126800537
feature_6=1.0
feature_7=0.02141261100769043
feature_9=0.05511551350355148
feature_10=0.08659037202596664

Xgboost score = 0.401647
Score with our own library = 0.4295020650820223

(NOTE: scores of individual trees are added (summation) and score = sigmod(sum of scores from each tree))

For the line marked different branching one can deduce that our library is evaluating the condition to be false, and hence ends up calculating score -0.0398532 as the score of the third tree.

Based on the score generated by xgboost one can deduce that this same condition is evaluated as true by xgboost and xgboost ends up calculating score -0.154575 as the score of the third tree.

It's most likely the differences in rounding.

Trees are printed using default std::stringstream precision

std::stringstream fo("");

which usually means that the float split values are represented with 6 meaningful digits, as you may see from your example. And default rounding, if I remember correctly, is towards zero.

You might try the following hack in order to see more digits in the split value: add fo.precision(18); after that line and rebuild.

Thanks @khotilov. That helped. Would it be a good idea to set the precision to the highest precision of float in xgboost so as to avoid discrepancies in what xgboost uses as splitvalue/score and what other libraries using the output of xgboost use during scoring. If so, I can create a pull request

The same would be required for prediction value as well (

dmlc::ostream os(fo.get());
)