This library contains algorithms for iterative feature selection on Spark. It can be used on all 3 types of distributed datasets Spark supports (RDD, DataFrame, Dataset) and it allows the use of custom scoring functions for selection (the default one provided is mRMR, Minimum Redundancy - Maximum Relevance). Included there's also a command-line tool that allows to perform generate random integer datasets and to perform selection with mRMR on them.
This project is based on this paper and it's a rework of this implementation.
The following instructions will allow you to build the library into a .jar file.
The following software must be installed on your computer:
- Scala 2.11.12 (any 2.11 version should work, 2.12 is still not supported by Spark).
- Apache Spark 2.3.0
- sbt 1.1.4
Other versions may work but have not been tested.
To build the jar, execute the following command on your terminal while on the root of this repository
(where the build.sbt
file is located):
sbt assembly
The jar will be generated in the target/scala-2.11
directory and named as spark-ifs-assembly-X.Y.jar
(where X.Y is the version number, e.g. 1.0).
The generated jar file can be run with spark-submit
with the following syntax:
spark-submit [spark arguments] spark-ifs-assembly-X.Y.jar [tool arguments]
This link points to a list
of the arguments supported by spark-submit
.
The following listing shows all possible arguments for the tool.
This program can be used to do IFS on datasets loaded from csv files (and to generate random datasets to csv).
-h, --help Show help message
Subcommand: gen
Generates a dataset with the given size.
-a, --alt-file <arg> Path to the csv in alternate encoding (without the label row)
-c, --cols <arg> Number of columns
-f, --file <arg> Path to the csv in conventional encoding
-l, --labels <arg> Path to the csv containing the label row (required for alternate encoding)
-r, --rows <arg> Number of rows
-h, --help Show help message
Subcommand: select
Selects the given number of features from the provided csv datasets.
NOTE: for this task spark-submit must be used.
-a, --alt-file <arg> Path to the csv in alternate encoding (without the label row)
-f, --file <arg> Path to the csv in conventional encoding
-l, --labels <arg> Path to the csv containing the label row (required for alternate encoding)
-n, --num-features <arg> Number of features (columns) to be selected
-v, --verbose Prints more information during execution
--noverbose Only prints the results
-h, --help Show help message
Note: the gen
subcommand does not need spark-submit
to be run. scala
or even java -jar
can be used.
To include library files in your program, you can either copy them into your source directory, or you can add the generated jar to your classpath.