matthieuvion / databricks-xgb

Distributed XGB on Spark with MLFlow model tracking. Code & notes on compatibility of (3!) XGB variants with PySpark/Databricks/MLFlow. Ensure notebook works with Community Edition.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Databricks ML quickstart notebook, scaled up

Adaptation of Databricks notebook "ML training".

  • Replace (almost) everything Pandas/Sklearn related by their Spark/Xgboost counterparts
  • Ensure compatibility with Community Edition (15gb, 2 workers)

Inside:

  • Part 1: Training classification model using the "new", distributed implementation of xgboost : spark.xgboost. We also use gridsearch & Spark Pipeline()
  • Part 2: Better hyperparameters tuning with "classic" xgboost, Hyperopt+SparkTrials, as well as tracking performance using MLflow.

Changes applied:

Compatibility with community edition:

  • In Community edition the /dbfs/ mount point isn't available. We "fix" that by copying the datasets to a local folder

Scale up:

  • Tl,dr: replace everything pandas/sklearn-related with their counterparts
  • Does not really need it, as we have a small dataset ;)
  • Load file and transformations with Spark (-> spark.dataframe) instead of Pandas
  • Preprocessing using Spark functions (instead of Pandas)
  • Remove everything related to Scitkit-learn, replace with pyspark.ml or xgboost lib
  • e.g xgboost or xgboost.spark instead of sklearn GradientBoostingClassifier
  • Streamline our modeling workflow, using Spark Pipeline()

What we keep from the original notebook:

  • General flow / goals / examples
  • Particularly : MLFlow tracking & Distributed params optimization with Hyperopt + Sparktrials

Remarks on xgboost + MLFlow tracking as of July 2023:

  • This might help you if lost with all xgboost implementations + compatibility:
  • Your choices for XGB: classic xgboost, xgboost.spark (new, distributed) or sklearn GradientBoosting or sparkdl.xgb
  • sparkdl soon deprecated, xgboost.spark not (yet?) compatible with MLFlow autolog, and overall it seems a real mess to perform search grid & cv in conjunction with MLFlow
  • In part 1. original notebook uses sklearn + MLFLow on successive runs of models with different params. We replace it with xgboost.spark, without MLFlow, but with a pipeline() and hyperparams optimization.
  • If you still wanted to monitor with MLflow, you could use xgboost classic lib, with mlflow.xgboost as shown here
  • In part 2. original notebook uses sklearn + grid search with Hyperopt.SparkTrials, we replace sklearn with xgboost lib and keep optimization w/ hyperopt + SparkTrials.

About

Distributed XGB on Spark with MLFlow model tracking. Code & notes on compatibility of (3!) XGB variants with PySpark/Databricks/MLFlow. Ensure notebook works with Community Edition.


Languages

Language:Jupyter Notebook 100.0%