Adaptation of Databricks notebook "ML training".
- Replace (almost) everything Pandas/Sklearn related by their Spark/Xgboost counterparts
- Ensure compatibility with Community Edition (15gb, 2 workers)
Inside:
- Part 1: Training classification model using the "new", distributed implementation of xgboost :
spark.xgboost
. We also use gridsearch & SparkPipeline()
- Part 2: Better hyperparameters tuning with "classic" xgboost,
Hyperopt
+SparkTrials, as well as tracking performance usingMLflow
.
Changes applied:
Compatibility with community edition:
- In Community edition the /dbfs/ mount point isn't available. We "fix" that by copying the datasets to a local folder
Scale up:
- Tl,dr: replace everything pandas/sklearn-related with their counterparts
- Does not really need it, as we have a small dataset ;)
- Load file and transformations with Spark (-> spark.dataframe) instead of Pandas
- Preprocessing using Spark functions (instead of Pandas)
- Remove everything related to Scitkit-learn, replace with pyspark.ml or xgboost lib
- e.g xgboost or xgboost.spark instead of sklearn GradientBoostingClassifier
- Streamline our modeling workflow, using Spark Pipeline()
What we keep from the original notebook:
- General flow / goals / examples
- Particularly : MLFlow tracking & Distributed params optimization with Hyperopt + Sparktrials
Remarks on xgboost + MLFlow tracking as of July 2023:
- This might help you if lost with all xgboost implementations + compatibility:
- Your choices for XGB: classic
xgboost
,xgboost.spark
(new, distributed) or sklearnGradientBoosting
orsparkdl.xgb
- sparkdl soon deprecated, xgboost.spark not (yet?) compatible with MLFlow autolog, and overall it seems a real mess to perform search grid & cv in conjunction with MLFlow
- In part 1. original notebook uses sklearn + MLFLow on successive runs of models with different params. We replace it with xgboost.spark, without MLFlow, but with a
pipeline()
and hyperparams optimization. - If you still wanted to monitor with MLflow, you could use xgboost classic lib, with mlflow.xgboost as shown here
- In part 2. original notebook uses sklearn + grid search with Hyperopt.SparkTrials, we replace sklearn with xgboost lib and keep optimization w/
hyperopt + SparkTrials
.