kingfengji / mGBDT

This is the official clone for the implementation of the NIPS18 paper Multi-Layered Gradient Boosting Decision Trees (mGBDT) .

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Performance of your model on regression tasks

KiwiAthlete opened this issue · comments

Description

@kingfengji Thanks for making the code available. I believe that multi-layered decision trees is a very elegant and powerful approach! I was applying your model to the boston housing dataset but wasn't able to outperform a baseline xgboost model.

Details

To compare your approach to several alternatives, I ran a small benchmark study using the following approaches, where all models have the same hyper-parameters

  • baseline xgboost model (xgboost)
  • mGBDT with xgboost for hidden and output layer (mGBDT_XGBoost)
  • mGBDT with xgboost for hidden but with linear model for output layer (mGBDT_Linear)
  • linear model as implemented here (Linear)

I am using PyTorch's L1Loss for model training and use the MAE for evaluation, where all models are trained in serial mode. Results are as follows

image

In particular, I observe the following

  • irresepective of the hyper-parameters and number of epochs, a basline xgboost model tends to outperforms your approach
  • with increasing number of epochs, the runtime for an epoch increases considerably. Any idea as to why this happens?
  • using mGBDT_Linear,
    • I wasn't able to use PyTorch's MSELoss since the loss exploded after some iterations, even after normalizing X. Should we, similar to Neural Networks, also scale y to avoid exploding gradients?
    • the training loss starts at exceptionally high values, then decreases before it starts to increase again

Additional Questions

  • Given that you have mostly been using your approach for classification tasks, is there anything we need to change before we use it for regression tasks, except the PyTorch Loss?
  • Besides the loss of F, can we also track how well the target propagation is working by evaluating the reconstruction loss of G?
  • When using mGBDT with a linear output layer, would we expect to generally see better results compared to using xgboost for the output layer?
  • What is the benefit of using a linear output layer compared to a xgboost layer?
  • For training F and G, you are currently using the MSELoss for the xgboost models. Do you have some experience with modifying this loss?
  • What is the effect of the number of iterations for initializing the model before training?
  • What is the relationship between the number of boosting iterations (for xgboost training) and the number of epochs (for MGBDT training)?
  • In Section 4 of your paper you state "The experiments for this section is mainly designed to empirically examine if it is feasible to jointly train the multi-layered structure proposed by this work. That is, we make no claims that the current structure can outperform CNNs in computer vision tasks." So as a question, would that mean that your intention is not to outperform existing Deep Learning based models, say CNN, or to outperform existing GBM-models, like XGBoost, but rather to show that a Decision Tree based model can be also used for learning meaningful representations that can then be used for downstreaming tasks?
  • Connected to the previous question: Gradient boosting models are already very strong learners that obtain very good results in many applications. So what would be your motivation of using multiple layers of such a model? May it even happen that, based on the implicit error correction mechanism of GBM, training several of them leads to a drop in accuracy?

Code

To reproduce the results, you an use the attached notebook.

ModelComparison.zip

@kingfengji I would highly appreciate your feedback. Many thanks.