cascading.pattern

Pattern sub-project for http://Cascading.org/ which uses flows as containers for machine learning models, importing PMML model descriptions from R, SAS, Weka, RapidMiner, KNIME, SQL Server, etc.

Current support for PMML includes:

Random Forest in PMML 4.0+ exported from R/Rattle
Linear Regression in PMML 1.1+
Hierarchical Clustering and K-Means Clustering in PMML 2.0+
Logistic Regression in PMML 4.0.1+

Build Instructions

To build Pattern and then run its unit tests:

gradle --info --stacktrace clean test

The following scripts generate a baseline (model+data) for the Random Forest algorithm. This baseline includes a reference data set -- 1000 independent variables, 500 rows of simulated ecommerce orders -- plus a predictive model in PMML:

./src/py/gen_orders.py 500 1000 > orders.tsv
R --vanilla < ./src/r/rf_pmml.R > model.log

This will generate huge.rf.xml as the PMML export for a Random Forest classifier plus huge.tsv as a baseline data set for regression testing.

To build Pattern and run a regression test:

gradle clean jar
rm -rf out
hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap \
 --pmml data/sample.rf.xml --measure out/measure --assert

For each tuple in the data, a stream assertion tests whether the predicted field matches the score field generated by the model. Tuples which fail that assertion get trapped into out/trap/part* for inspection.

Also, the confusion matrix shown in out/measure/part* should match the one logged in model.log from baseline generated in R.

To run on Amazon AWS, take a look at the emr.sh script.

Classifier vs. Predictive Model

Here's how to run an example classifier using Random Forest:

gradle clean jar
rm -rf out
hadoop jar build/libs/pattern.jar data/iris.rf.tsv out/classify out/trap \
 --pmml data/iris.rf.xml --measure out/measure --label species

Here's how to run an example predictive model using Linear Regression:

gradle clean jar
rm -rf out
hadoop jar build/libs/pattern.jar data/iris.lm_p.tsv out/classify out/trap \
 --pmml data/iris.lm_p.xml --rmse out/measure

Use in Cascading Apps

Alternatively, if you want to re-use this assembly for your own Cascading app, remove the parts related to verifyPipe and measurePipe from the sample code.

The following snippet in R shows how to train a Random Forest model, then generate PMML as a file called sample.rf.xml:

f <- as.formula("as.factor(label) ~ .")
fit <- randomForest(f, data_train, ntree=50)
saveXML(pmml(fit), file="sample.rf.xml")

To use the PMML file in your Cascading app, this example it referenced as a command line argument called pmmlPath:

// define a "Classifier" model from PMML to evaluate the orders
ClassifierFunction classFunc = new ClassifierFunction( new Fields( "score" ), pmmlPath );
Pipe classifyPipe = new Each( new Pipe( "classify" ), classFunc.getFields(), classFunc, Fields.ALL );

Now when you run that Cascading app, provide a reference to sample.rf.xml for the pmmlPath argument.

An architectural diagram for common use case patterns is shown in docs/pattern.graffle which is an OmniGraffle document.

Example Models

Check the src/r/rattle_pmml.R script for examples of predictive models which are created in R, then exported using Rattle. These examples use the popular Iris data set.

random forest (rf)
linear regression (lm)
hierarchical clustering (hclust)
k-means clustering (kmeans)
logistic regression (glm)
multinomial model (multinom)
single hidden-layer neural network (nnet)
support vector machine (ksvm)
recursive partition classification tree (rpart)
association rules

To execute the R script:

R --vanilla < src/r/rattle_pmml.R

It is possible to extend PMML support for other kinds of modeling in R and other analytics platforms. Contact the developers to discuss on the cascading-user email forum.

PMML Resources

Data Mining Group XML standards and supported vendors
PMML In Action book
PMML validator