Pattern sub-project for http://Cascading.org/ which uses flows as containers for machine learning models, importing PMML model descriptions from R, SAS, Weka, RapidMiner, KNIME, SQL Server, etc.
Current support for PMML includes:
- Random Forest in PMML 4.0+ exported from R/Rattle
- Linear Regression in PMML 1.1+
- Hierarchical Clustering and K-Means Clustering in PMML 2.0+
- Logistic Regression in PMML 4.0.1+
To build Pattern and then run its unit tests:
gradle --info --stacktrace clean test
The following scripts generate a baseline (model+data) for the Random Forest algorithm. This baseline includes a reference data set -- 1000 independent variables, 500 rows of simulated ecommerce orders -- plus a predictive model in PMML:
./src/py/gen_orders.py 500 1000 > orders.tsv
R --vanilla < ./src/r/rf_pmml.R > model.log
This will generate huge.rf.xml
as the PMML export for a Random
Forest classifier plus huge.tsv
as a baseline data set for
regression testing.
To build Pattern and run a regression test:
gradle clean jar
rm -rf out
hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap \
--pmml data/sample.rf.xml --measure out/measure --assert
For each tuple in the data, a stream assertion tests whether the
predicted
field matches the score
field generated by the
model. Tuples which fail that assertion get trapped into
out/trap/part*
for inspection.
Also, the confusion matrix shown in out/measure/part*
should
match the one logged in model.log
from baseline generated in R.
To run on Amazon AWS, take a look at the emr.sh
script.
Here's how to run an example classifier using Random Forest:
gradle clean jar
rm -rf out
hadoop jar build/libs/pattern.jar data/iris.rf.tsv out/classify out/trap \
--pmml data/iris.rf.xml --measure out/measure --label species
Here's how to run an example predictive model using Linear Regression:
gradle clean jar
rm -rf out
hadoop jar build/libs/pattern.jar data/iris.lm_p.tsv out/classify out/trap \
--pmml data/iris.lm_p.xml --rmse out/measure
Alternatively, if you want to re-use this assembly for your own
Cascading app, remove the parts related to verifyPipe
and
measurePipe
from the sample code.
The following snippet in R shows how to train a Random Forest model,
then generate PMML as a file called sample.rf.xml
:
f <- as.formula("as.factor(label) ~ .")
fit <- randomForest(f, data_train, ntree=50)
saveXML(pmml(fit), file="sample.rf.xml")
To use the PMML file in your Cascading app, this example it
referenced as a command line argument called pmmlPath
:
// define a "Classifier" model from PMML to evaluate the orders
ClassifierFunction classFunc = new ClassifierFunction( new Fields( "score" ), pmmlPath );
Pipe classifyPipe = new Each( new Pipe( "classify" ), classFunc.getFields(), classFunc, Fields.ALL );
Now when you run that Cascading app, provide a reference to
sample.rf.xml
for the pmmlPath
argument.
An architectural diagram for common use case patterns is shown in
docs/pattern.graffle
which is an OmniGraffle document.
Check the src/r/rattle_pmml.R
script for examples of predictive
models which are created in R, then exported using Rattle.
These examples use the popular
Iris data set.
- random forest (rf)
- linear regression (lm)
- hierarchical clustering (hclust)
- k-means clustering (kmeans)
- logistic regression (glm)
- multinomial model (multinom)
- single hidden-layer neural network (nnet)
- support vector machine (ksvm)
- recursive partition classification tree (rpart)
- association rules
To execute the R script:
R --vanilla < src/r/rattle_pmml.R
It is possible to extend PMML support for other kinds of modeling in R and other analytics platforms. Contact the developers to discuss on the cascading-user email forum.
- Data Mining Group XML standards and supported vendors
- PMML In Action book
- PMML validator