JPMML-SkLearn

Java library and command-line application for converting Scikit-Learn pipelines to PMML.

Features

Supported Estimator and Transformer types:
- Clustering:
  - cluster.KMeans
  - cluster.MiniBatchKMeans
- Composite Estimators:
  - compose.ColumnTransformer
  - compose.TransformedTargetRegressor
- Matrix Decomposition:
  - decomposition.PCA
  - decomposition.IncrementalPCA
- Discriminant Analysis:
  - discriminant_analysis.LinearDiscriminantAnalysis
- Dummies:
  - dummy.DummyClassifier
  - dummy.DummyRegressor
- Ensemble Methods:
- Feature Extraction:
- Feature Selection:
  - feature_selection.GenericUnivariateSelect (only via sklearn2pmml.SelectorProxy)
  - feature_selection.RFE (only via sklearn2pmml.SelectorProxy)
  - feature_selection.RFECV (only via sklearn2pmml.SelectorProxy)
  - feature_selection.SelectFdr (only via sklearn2pmml.SelectorProxy)
  - feature_selection.SelectFpr (only via sklearn2pmml.SelectorProxy)
  - feature_selection.SelectFromModel (either directly or via sklearn2pmml.SelectorProxy)
  - feature_selection.SelectFwe (only via sklearn2pmml.SelectorProxy)
  - feature_selection.SelectKBest (either directly or via sklearn2pmml.SelectorProxy)
  - feature_selection.SelectPercentile (only via sklearn2pmml.SelectorProxy)
  - feature_selection.VarianceThreshold (only via sklearn2pmml.SelectorProxy)
- Impute:
  - impute.SimpleImputer
- Generalized Linear Models:
- Multiclass classification:
  - multiclass.OneVsRestClassifier
- Naive Bayes:
  - naive_bayes.GaussianNB
- Nearest Neighbors:
  - neighbors.KNeighborsClassifier
  - neighbors.KNeighborsRegressor
- Pipelines:
  - pipeline.FeatureUnion
  - pipeline.Pipeline
- Neural network models:
  - neural_network.MLPClassifier
  - neural_network.MLPRegressor
- Preprocessing and Normalization:
- Support Vector Machines:
- Decision Trees:
Supported third-party Estimator and Transformer types:
- H2O.ai:
- LightGBM:
  - lightgbm.LGBMClassifier
  - lightgbm.LGBMRegressor
- SkLearn2PMML:
  - sklearn2pmml.EstimatorProxy
  - sklearn2pmml.SelectorProxy
  - sklearn2pmml.decoration.Alias
  - sklearn2pmml.decoration.CategoricalDomain
  - sklearn2pmml.decoration.ContinuousDomain
  - sklearn2pmml.decoration.DateDomain
  - sklearn2pmml.decoration.DateTimeDomain
  - sklearn2pmml.decoration.MultiDomain
  - sklearn2pmml.feature_selection.SelectUnique
  - sklearn2pmml.pipeline.PMMLPipeline
  - sklearn2pmml.preprocessing.Aggregator
  - sklearn2pmml.preprocessing.CutTransformer
  - sklearn2pmml.preprocessing.DaysSinceYearTransformer
  - sklearn2pmml.preprocessing.ExpressionTransformer
  - sklearn2pmml.preprocessing.LookupTransformer
  - sklearn2pmml.preprocessing.MultiLookupTransformer
  - sklearn2pmml.preprocessing.PMMLLabelBinarizer
  - sklearn2pmml.preprocessing.PMMLLabelEncoder
  - sklearn2pmml.preprocessing.PowerFunctionTransformer
  - sklearn2pmml.preprocessing.SecondsSinceYearTransformer
  - sklearn2pmml.preprocessing.StringNormalizer
  - sklearn2pmml.preprocessing.h2o.H2OFrameCreator
  - sklearn2pmml.ruleset.RuleSetClassifier
- Sklearn-Pandas:
  - sklearn_pandas.CategoricalImputer
  - sklearn_pandas.DataFrameMapper
- TPOT:
  - tpot.builtins.stacking_estimator.StackingEstimator
- XGBoost:
  - xgboost.XGBClassifier
  - xgboost.XGBRegressor
Production quality:
- Complete test coverage.
- Fully compliant with the JPMML-Evaluator library.

Prerequisites

The Python side of operations

Python 2.7, 3.4 or newer.
scikit-learn 0.16.0 or newer.
sklearn-pandas 0.0.10 or newer.
sklearn2pmml 0.14.0 or newer.

Validating Python installation:

import sklearn, sklearn.externals.joblib, sklearn_pandas, sklearn2pmml

print(sklearn.__version__)
print(sklearn.externals.joblib.__version__)
print(sklearn_pandas.__version__)
print(sklearn2pmml.__version__)

The JPMML-SkLearn side of operations

Java 1.8 or newer.

Installation

Enter the project root directory and build using Apache Maven:

mvn clean install

The build produces an executable uber-JAR file target/jpmml-sklearn-executable-1.5-SNAPSHOT.jar.

Usage

A typical workflow can be summarized as follows:

Use Python to train a model.
Serialize the model in pickle data format to a file in a local filesystem.
Use the JPMML-SkLearn command-line converter application to turn the pickle file to a PMML file.

The Python side of operations

Loading data to a pandas.DataFrame object:

import pandas

df = pandas.read_csv("Iris.csv")

iris_X = df[df.columns.difference(["Species"])]
iris_y = df["Species"]

First, creating a sklearn_pandas.DataFrameMapper object, which performs column-oriented feature engineering and selection work:

from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import StandardScaler
from sklearn2pmml.decoration import ContinuousDomain

column_preprocessor = DataFrameMapper([
    (["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"], [ContinuousDomain(), StandardScaler()])
])

Second, creating Transformer and Selector objects, which perform table-oriented feature engineering and selection work:

from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import Pipeline
from sklearn2pmml import SelectorProxy

table_preprocessor = Pipeline([
	("pca", PCA(n_components = 3)),
	("selector", SelectorProxy(SelectKBest(k = 2)))
])

Please note that stateless Scikit-Learn selector objects need to be wrapped into an sklearn2pmml.SelectprProxy object.

Third, creating an Estimator object:

from sklearn.tree import DecisionTreeClassifier

classifier = DecisionTreeClassifier(min_samples_leaf = 5)

Combining the above objects into a sklearn2pmml.pipeline.PMMLPipeline object, and running the experiment:

from sklearn2pmml.pipeline import PMMLPipeline

pipeline = PMMLPipeline([
    ("columns", column_preprocessor),
    ("table", table_preprocessor),
    ("classifier", classifier)
])
pipeline.fit(iris_X, iris_y)

Embedding model verification data:

pipeline.verify(iris_X.sample(n = 15))

Storing the fitted PMMLPipeline object in pickle data format:

from sklearn.externals import joblib

joblib.dump(pipeline, "pipeline.pkl.z", compress = 9)

Please see the test script file main.py for more classification (binary and multi-class) and regression workflows.

The JPMML-SkLearn side of operations

Converting the pipeline pickle file pipeline.pkl.z to a PMML file pipeline.pmml:

java -jar target/jpmml-sklearn-executable-1.5-SNAPSHOT.jar --pkl-input pipeline.pkl.z --pmml-output pipeline.pmml

Getting help:

java -jar target/jpmml-sklearn-executable-1.5-SNAPSHOT.jar --help

License

JPMML-SkLearn is dual-licensed under the GNU Affero General Public License (AGPL) version 3.0, and a commercial license.

Additional information

JPMML-SkLearn is developed and maintained by Openscoring Ltd, Estonia.

Interested in using JPMML software in your application? Please contact info@openscoring.io

About

Java library and command-line application for converting Scikit-Learn pipelines to PMML

GNU Affero General Public License v3.0

Languages

Language:Java 92.7%Language:Python 7.3%