ceteri / pattern

"Pattern" sub-project for Cascading, which uses Cascading flows as containers for machine learning models, importing PMML model descriptions from R, SAS, KNIME, Weka, RapidMiner, etc.

Home Page:http://cascading.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

cascading.pattern

Pattern sub-project for http://Cascading.org/ which uses flows as containers for machine learning models, importing PMML model descriptions from R, SAS, Weka, RapidMiner, KNIME, SQL Server, etc.

Current support for PMML includes:

Build Instructions

To build Pattern and then run its unit tests:

gradle --info --stacktrace clean test

The following scripts generate a baseline (model+data) for the Random Forest algorithm. This baseline includes a reference data set -- 1000 independent variables, 500 rows of simulated ecommerce orders -- plus a predictive model in PMML:

./src/py/gen_orders.py 500 1000 > orders.tsv
R --vanilla < ./src/r/rf_pmml.R > model.log

This will generate huge.rf.xml as the PMML export for a Random Forest classifier plus huge.tsv as a baseline data set for regression testing.

To build Pattern and run a regression test:

gradle clean jar
rm -rf out
hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap \
 --pmml data/sample.rf.xml --measure out/measure --assert

For each tuple in the data, a stream assertion tests whether the predicted field matches the score field generated by the model. Tuples which fail that assertion get trapped into out/trap/part* for inspection.

Also, the confusion matrix shown in out/measure/part* should match the one logged in model.log from baseline generated in R.

To run on Amazon AWS, take a look at the emr.sh script.

Classifier vs. Predictive Model

Here's how to run an example classifier using Random Forest:

gradle clean jar
rm -rf out
hadoop jar build/libs/pattern.jar data/iris.rf.tsv out/classify out/trap \
 --pmml data/iris.rf.xml --measure out/measure --label species

Here's how to run an example predictive model using Linear Regression:

gradle clean jar
rm -rf out
hadoop jar build/libs/pattern.jar data/iris.lm_p.tsv out/classify out/trap \
 --pmml data/iris.lm_p.xml --rmse out/measure

Use in Cascading Apps

Alternatively, if you want to re-use this assembly for your own Cascading app, remove the parts related to verifyPipe and measurePipe from the sample code.

The following snippet in R shows how to train a Random Forest model, then generate PMML as a file called sample.rf.xml:

f <- as.formula("as.factor(label) ~ .")
fit <- randomForest(f, data_train, ntree=50)
saveXML(pmml(fit), file="sample.rf.xml")

To use the PMML file in your Cascading app, this example it referenced as a command line argument called pmmlPath:

// define a "Classifier" model from PMML to evaluate the orders
ClassifierFunction classFunc = new ClassifierFunction( new Fields( "score" ), pmmlPath );
Pipe classifyPipe = new Each( new Pipe( "classify" ), classFunc.getFields(), classFunc, Fields.ALL );

Now when you run that Cascading app, provide a reference to sample.rf.xml for the pmmlPath argument.

An architectural diagram for common use case patterns is shown in docs/pattern.graffle which is an OmniGraffle document.

Example Models

Check the src/r/rattle_pmml.R script for examples of predictive models which are created in R, then exported using Rattle. These examples use the popular Iris data set.

  • random forest (rf)
  • linear regression (lm)
  • hierarchical clustering (hclust)
  • k-means clustering (kmeans)
  • logistic regression (glm)
  • multinomial model (multinom)
  • single hidden-layer neural network (nnet)
  • support vector machine (ksvm)
  • recursive partition classification tree (rpart)
  • association rules

To execute the R script:

R --vanilla < src/r/rattle_pmml.R

It is possible to extend PMML support for other kinds of modeling in R and other analytics platforms. Contact the developers to discuss on the cascading-user email forum.

PMML Resources

About

"Pattern" sub-project for Cascading, which uses Cascading flows as containers for machine learning models, importing PMML model descriptions from R, SAS, KNIME, Weka, RapidMiner, etc.

http://cascading.org/

License:Other