SparkXGBoost is a Spark implementation of gradient boosting tree using 2nd order approximation of arbitrary user-defined loss function. SparkXGBoost is inspired by the XGBoost project.
SparkXGBoost
is distributed under Apache License 2.0.
The XGBoost team have a fantastic introduction to gradient boosting trees.
SparkXGBoost version supports supervised learning with the gradient boosting tree using 2nd order approximation of arbitrary user-defined loss function. SparkXGBoost ships with The following Loss
classes:
SquareLoss
for linear (normal) regressionLogisticLoss
for binary classificationPoissonLoss
for Poisson regression of count data
To avoid overfitting, SparkXGBoost employs the following regularization methods:
- Shrinkage by learning rate (aka step size)
- L2 regularization term on node
- L1 regularization term on node
- Stochastic gradient boosting (similar to Bagging)
- Feature sub sampling for learning nodes
SparkXGBoost is capable of processing multiple learning nodes in the one pass of the training data to improve efficiency.
SparkXGBoost implements the Spark ML Pipeline API, allowing you to easily run a sequence of algorithms to process and learn from data.
SparkXGBoostRegressor
andSparkXGBoostRegressionModel
are the predictor and model for continuous labels.SparkXGBoostClassifier
andSparkXGBoostClassificationModel
are the predictor and model for categorical labels.
In the constructors of SparkXGBoostRegressor
and SparkXGBoostClassifier
, users will need to supply an instance of
the Loss
class to define the loss functions and its derivatives. SparkXGBoost currently comes with
SquareLoss
for linear (normal) regression, LogisticLoss
for binary classification and
PoissonLoss
for Poisson regression of count data. Additional loss function can be specified by the user
by sub-classing the Loss
.
abstract class Loss{
// The 1st derivative
def diff1(label: Double, f: Double): Double
// The 2nd derivative
def diff2(label: Double, f: Double): Double
// Generate prediction from the score suggested by the tree ensemble
// For regression, prediction is the label
// For classification, prediction is the probability in each class
def toPrediction(score: Double): Double
// Calculate bias
def getInitialBias(input: RDD[LabeledPoint]): Double
}
Please see the example below for typical usage.
trainingData
is a DataFrame
with the labels stored in a column named "label" and the feature vectors stored in a column name "features". Similarly, testData
is DataFrame
with the feature vectors stored in a column name "features".
Please note that the feature vectors have to been indexed before feeding to the pipeline to ensure the categorical variables are correctly encoded with metadata.
Currently, all categorical variables are assumed to be ordered. Unordered categorical variables can be used for training after being coded with OneHotEncoder.
val featureIndexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
.setMaxCategories(2)
.fit(trainingData)
val sparkXGBoostRegressor = new SparkXGBoostRegressor(new SquareLoss)
.setFeaturesCol("indexedFeatures")
.setMaxDepth(2)
.setNumTrees(5)
val pipeline = new Pipeline()
.setStages(Array(featureIndexer, sparkXGBoostRegressor))
val model = pipeline.fit(data)
val prediction = model.transform(testData)
The following parameters can be specified by the setters.
- labelCol [default="label"]
- the name of the label column of the
DataFrame
- String
- the name of the label column of the
- featuresCol [default="features"]
- the name of the feature column of the
DataFrame
- String
- the name of the feature column of the
- numTrees [default=1]
- number of trees to be grown in the boosting algorithm.
- Int, range: [1, ∞]
- maxDepth [default=5]
- maximum depth of a tree. A tree with one root and two leaves is considered to have depth = 1.
- Int, range: [1,∞]
- lambda [default=0]
- L2 regularization term on weights.
- Double, range: [0, ∞]
- alpha [default=0]
- L1 regularization term on weights.
- Double, range: [0, ∞]
- gamma [default=0]
- minimum loss reduction required to make a further partition on a leaf node of the tree.
- Double, range: [0, ∞]
- eta [default=1.0]
- learning rate (aka step size) for gradient boosting.
- Double, range: (0, 1]
- minInstanceWeight [default=1]
- minimum weight (aka, number of data instance) required to make a further partition on a leaf node of the tree.
- Double, range: [0, ∞]
- sampleRatio [default=1.0]
- sample ratio of rows in bagging
- Double, range(0, 1]
- featureSampleRatio [default=1.0]
- sample ratio of columns when constructing each tree.
- Double, range: (0, 1]
- maxConcurrentNodes [default=50]
- maximal number of nodes to be process in one pass of the training data.
- Int, [1, ∞]
- maxBins [default=32]
- maximal number of bins for continuous variables.
- Int, [2, ∞]
- seed [default = some random value]
- seed of sampling.
- Long
The following parameters can be specified by the setters in SXGBoostModel
.
- predictionCol [default="prediction"]
- the name of the prediction column of the
DataFrame
- String
- the name of the prediction column of the
- featuresCol [default="features"]
- the name of the feature column of the
DataFrame
- String
- the name of the feature column of the
SparkXGBoost has been tested with Spark 1.5.1 and Scala 2.10.
Releases of SparkXGBoost are available on spark-package.org. You can follow the "How to" for spark-shell, sbt or maven.
As SparkXGBoost is currently under active development, the spark-package.org release might not always include the latest update.
You can access the latest cutting edge codebase through compilation from the source.
Step 1: clone the project from GitHub
git clone https://github.com/rotationsymmetry/sparkxgboost.git
Step 2: compile and package the jar using sbt
cd SparkXGBoost
sbt clean package
You should be able to find the jar file in target/target/scala-2.10/sparkxgboost_2.10-x.y.z.jar
Step 3: load it in your Spark project
- If you are using spark-shell, you can type in
./spark-shell --jars path/to/sparkxgboost_2.10-x.y.z.jar
- If you are building Spark application with sbt, you can put the jar file into the
lib
folder next tosrc
. Then sbt should be able to put SparkXGBoost in your class path.
I have following tentative roadmap for the upcoming releases:
0.3
- Post-pruning
0.4
- Automatically determine the maximal number of current nodes by memory management
0.5
- Multi-class classification
0.6
- Unordered categorical variables
Many thanks for testing SparkXGBoost!
You can file bug report or provide suggestions using GitHub Issues.
If you would like to improve the codebase, please don't hesitate to submit a pull request.