NOTE:
- This is highly experimental code as a proof of concept, so there are many many areas of improvements, and bugs. Use for fun only. A more complete open source version will be released later.
- In terms of the backend for evaluating a given model in elasticsearch, it is independent of Spark, Weka or R. This is just to show that the integration is possible to train a model in any desired ML framework
##GENERATOR
Generator protion of the code trains a model and indexes a csv into elasticsearch that will generate:
- Trained model with the given algorith=m.
- Elasticsearch index with the corresponding types and values.
- Configuration that will be used by the plugin.
- data.filename = CSV file that will be loaded to generatea all thing.
- data.columns = List of the names of the attributes from the dataset that will be used to create the index and model. They need to be separated by comma and must include the variable that will be predicted. If the value used is "all", then it will use all the attributes of the datase.
- classifier.lib = Which library is gonna be used to generate the model. Right now it only has weka.
- train.percentage = Percentage of data that will be used on training. The rest of the data will be use to test the model.
- validate.options = Options regarding the type of validation that will be done over the model. The options are the following:
- Sd: Tests the model using the test partition as the evaluation set of the model.
- Cv: Tests the model using cross validator.
- validate.numFolds = Number of partitions for the cross validator.
- model.filename = Complete path where the model that will be created.
- cluster.name = Name of elasticsearch cluster.
- node.name = Name of elasticsearch node.
- index.name = Name of the index that will be created. It will be deleted before creating it again, so be careful.
- mapping.filename = File with the mapping that will be needed for the plugin.
- host = Elasticsearch host.
- port = Elasticsearch port.
- spark.model.type = currently only linear models supported
- spark.model.parmas = param1:val1,param2:val2,... params for model training
- spark.model.isregression =
- spark.model.binThreshold = specify threshold to binarize target variable
- spark.conf = sparkConfKey1:sparkConfVal1,sparkConfKey2:sparkConfVal2 set spark conf values such as master, parallelism, etc...
- weka.classifier.class = complete name of the Weka Classifier that will be used for the prediction: package + class name.
- weka.classifier.options = List of options required by the classifier in use, separated by spaces. If the classifier does not require options, then remove this property.
- weka.data.saveArff = Indicates if an ARFF file wants to be generated and saved. The required value is "true". With any other value the file is not generated.
- weka.date.fileArff = Complete path where the file that will be created. It needs to have .arff extension.
#Generic options
data.filename=adult_num.csv
data.columns=age,workclass,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours-per-week,native-country,probability
dyata.type=csv
#Classifier options
classifier.lib=spark.logistic-regression
train.percentage=80
validate.options=SdCv
validate.numFolds=5
model.filename=adult.model
#Elasticsearch options
cluster.name=elasticsearch_sdhu
node.name=Coral
index.name=adult-logistic-regression
mapping.filename=adult-logistic-regression.test
host=localhost
port=9300
#Spark options
spark.model.type=logistic-regression
spark.model.isregression=false
spark.model.binThreshold=0.5
spark.model.params=numIterations:100,regParam:0.01,minBatchFraction:1.0,stepSize:1.0
spark.model.numClasses=2
spark.conf=spark.master:local[4],spark.driver.memory:512m
#Generic options
data.filename=/path/to/csv/file
data.columns=age,workclass,sex,capital_gain,capital_loss,native-country,probability
#Classifier options
classifier.lib=weka
train.percentage=80
validate.options=SdCv
validate.numFolds=5
model.filename=/path/where/model/file/will/be/saved
#Elasticsearch options
cluster.name=
node.name=
index.name=
mapping.filename=/path/where/mapping/file/will/be/saved
host=localhost
port=9300
#Weka options
weka.classifier.class=weka.classifiers.trees.RandomTree
weka.classifier.options=-S 3 -K 2 -D 3 -G 0.0 -R 0.0 -N 0.5 -M 40.0 -C 1.0 -E 0.001 -P 0.1 -seed 1 -h 0
weka.data.saveArff=true
weka.data.fileArff=/path/where/arff/file/will/be/saved
export MAVEN_OPTS=-Xss2m
cd search-prediction-api
mvn clean
mvn package
cd elasticsearch-prediction-spark
sbt assembly
cd elasticsearch-prediction-pmml
sbt assembly
cd search-prediction-weka-impl
mkdir lib
cp search-prediction-api/target/search-prediction-api-1.0.jar search-prediction/lib
mvn clean
mvn package
cd search-prediction-example
mkdir lib
cp ../search-prediction-api/target/search-prediction-api-1.0.jar ./lib
cp ../search-prediction-weka-impl/target/search-prediction-weka-impl-1.0.jar ./lib
cp ../elasticsearch-prediction-spark/target/scala-2.10/elasticsearchprediction-spark-assembly-0.1.jar ./lib/
cp ../elasticsearch-prediction-pmml/target/scala-2.10/elasticsearchprediction-pmml-assembly-0.1.jar ./lib/
mvn install:install-file -Dfile=lib/search-prediction-weka-impl-1.0.jar -DgroupId=com.mahisoft.elasticsearchprediction -DartifactId=search-prediction-weka-impl -Dversion=1.0 -Dpackaging=jar
mvn install:install-file -Dfile=lib/search-prediction-api-1.0.jar -DgroupId=com.mahisoft.elasticsearchprediction -DartifactId=search-prediction-api -Dversion=1.0 -Dpackaging=jar
mvn install:install-file -Dfile=lib/elasticsearchprediction-spark-assembly-0.1.jar -DgroupId=com.sdhu -DartifactId=elasticsearchprediction-spark -Dversion=0.1 -Dpackaging=jar
mvn install:install-file -Dfile=lib/elasticsearchprediction-pmml-assembly-0.1.jar -DgroupId=com.sdhu -DartifactId=elasticsearchprediction-pmml -Dversion=0.1 -Dpackaging=jar
mvn clean
mvn -Pgenerator package
java -jar target/releases/search-prediction-1.0.jar /path/to/the/created/properties/file
If there are no errors, then the model and index were generated without any issue.
##PLUGIN
This will use the generated model and mapping to score documents in the created index to generate a plugin that can be installed into elasticsearch to generate real time evaluation of the scoring function.
- modelPath = Path where the generated model is.
- mapping = Contents of the mapping file generated
- classifier.lib = Which library is gonna be used to make the classification. Right now it only has weka.
- targetName = the column name of the predicted score should match data.column.label if trained with spark or weka module
Example:
modelPath=/path/to/model/file
mapping=age:double,workclass:string,fnlwgt:double,education:string,education_num:double,marital_status:string,occupation:string,relationship:string,race:string,sex:string,capital_gain:double,capital_loss:double,hours-per-week:double,native-country:string
classifier.lib=weka
targetName=probability
mvn clean
mvn -Pplugin package
sudo /usr/share/elasticsearch/bin/plugin -remove search-predictor
sudo /usr/share/elasticsearch/bin/plugin -install search-predictor url file:///path/to/target/releases/search-prediction-1.0.zip
sudo service elasticsearch restart
{
"query": {
"function_score": {
"query": {
"match_all": {}
},
"functions": [
{
"script_score": {
"script": "search-predictor",
"lang": "native",
"params": {}
}
}
],
"score_mode": "sum",
"boost_mode": "replace"
}
}
}
Make sure to use the following url: http://host:port/indexname/_default_