HowardRiddiough / deploy-sklearn-in-pyspark

Deploying python ML models in pyspark using Pandas UDFs

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Python ML Deployment in PySpark Using Pandas UDFs

This repo includes a notebook that defines a versatile python function that can be used to deploy python ml in PySpark, several examples are used to demonstrate how python ml can be deployed in PySpark:

  • Deploying a RandomForestRegressor in PySpark
  • Deployment of ML Pipeline that scales numerical features
  • Deployment of ML Pipeline that is capable of preprocessing mixed feature types

Introducing the spark_predict function: a vessle for python ml deployment in PySpark

Making predictions in PySpark using sophistaicated python ml is unlocked using our spark_predict function defined below.

spark_predict is a wrapper around a pandas_udf, a wrapper is used to enable a python ml model to be passed to the pandas_udf.

def spark_predict(model, cols) -> pyspark.sql.column:
    """This function deploys python ml in PySpark using the `predict` method of `model.

    Args:
        model: python ml model with sklearn API
        cols (list-like): Features used for predictions, required to be present as columns in the spark DataFrame used to make predictions.
    """
    @sf.pandas_udf(returnType=DoubleType())
    def predict_pandas_udf(*cols):
        # cols will be a tuple of pandas.Series here.
        x = pd.concat(cols, axis=1)
        return pd.Series(model.predict(x))

    return predict_pandas_udf(*cols)

Python ML Deployment in practice

The deploying-python-ml-in-pyspark notebook demonstrates how spark_predict can be used to deploy python ML in PySpark. It is shown that spark_predict is capable of deploying simple ml models in addition to more sophisticated pipelines in PySpark.

I often use both categorical and numerical features in predictive model, so I have included an example that includes an sklearn Pipeline designed to scale numerical and encode categorical data. This particular pipeline appends two preprocessing pipelines to a random forest to create a full prediction pipeline that will transform categorical and numerical data and fit a model. And of course this pipeline is deployed in PySpark using the spark_predict function.

Requirements

See requirements.txt.

PySpark Installation

The code used in the deploying-python-ml-in-pyspark notebook requires installation of PySpark. We leave the installation of PySpark for the user.

Further Reading

About

Deploying python ML models in pyspark using Pandas UDFs


Languages

Language:Jupyter Notebook 100.0%