cassowary4health / FunctionalRandomForest

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Functional Random Forest (FRF) Manual

The Functional Random Forest presented in Subtyping cognitive profiles in Autism Spectrum Disorder using a random forest algorithm, E. Feczko Ph.D. et al.

This manual documents how to use the RFAnalysis package written by Eric Feczko. The manual is split into two main sections. The first section covers how to analyze cross-sectional data with the Random forest subgroup detection (RFSD) tool. The second covers how to analyze longitudinal trajectories with the Function Random Forest (FRF) tool. A brief introduction will walk the user through installing the software.

Installation

Getting the package

The FRF code can be found on the github repository. This repository is intended for public release as a stable build. For the developmental builds please contact Eric Feczko (feczko@ohsu.edu) or the Fair Lab. The repository itself can be installed at any location through use of git:

https://github.com/DCAN-Labs/functional-random-forest.git /destination/path/for/FRF

Dependencies

There are two versions of the FRF, the source version and the compiled version. The source FRF has two external dependencies:

  1. MATLAB version 2016 or higher
  2. MATLAB Machine learning and statistical toolbox
  3. MATLAB Parallell Computing Toolbox [OPTIONAL]

The compiled FRF has one external dependency:

  1. The MATLAB compiler engine which should be provided by the Fair Lab.

Both the source and compiled versions also require that Infomap (http://mapequation.org) is installed on your system.

There is also a python2.7 dependency in this project, to support a simpler interface to Infomap. In order to install this dependency you must install

  • python2.7
  • numpy
  • scipy

These can be installed to your system, or by use of virtualenv. The instructions for this will are located here

Using the package in the matlab environment

FRF was designed such that an individual does not need to access the matlab IDE to run the data. If one is planning to use any of the matlab functions themselves, one will need to add the package to the MATLAB path. After starting matlab type:

addpath(genpath('/destination/path/for/Repository/'))

and the matlab functions will be available within a matlab session.

Random forest subgroup detection (RFSD)

Preparing your data

Generating your data file

The FRF package requires input data represented as a 2D numeric or 2D cell matrix stored in a .mat format. Each row contains an independent case and each column contains a feature (e.g. a predictor variable or the outcome measure). Usually, the source of the input data is external to matlab and will need to be represented as a variable and stored in the .mat format. Depending on the type of source, different steps can be taken to prepare the data properly.

NOTE that empty cells in your data set are expected to be represented by NaNs before being passed into the RFSD algorithm. If you have empty cells, or cells with only whitespace, in an excel spreadsheet, the provided function PrepExcelSheetForTreeBagging will create a matlab struct with correctly formatted data.

Excel

For excel spreadsheets, a function, PrepExcelSheetForTreeBagging can be used. In a MATLAB environment, add the package to your MATLAB path (see: “Using the package in the matlab environment" under “Installation") type:

help PrepExcelSheetForTreeBagging

for documentation on usage.

On a command line, the PrepExcelSheetForTreeBagging_wrapper.sh can be used to prepare your data from an excel sheet. The wrapper requires you to create a parameter file, which can be modified from the existing PrepExcelSheetForTreeBagging_example.bash file. The contents of the file are reprinted below. Each parameter precedes its definition; all definitions are noted by hashmarks, excluding the first line:

#!/bin/bash

# path and filename where the excel spreadsheet is located
excelfile=/path/to/excelfile/.xls

# the name of the output (.mat) file.
output_matfile=/path/to/dataset.mat

# if set to anything but 0 or blank, the first row of the excel file is
#   a header and will be ignored
exists_header=0

# a numeric vector encapsulated by square brackets, where each number
#   denotes a column that represents a categorical variable, set to 0 if
#   no such variable exists
string_cols=[2 3 4]

# sets whether the output contains rows with missing data ('surrogate')
#   or excludes the rows ('no_surrogate')
type='surrogate'

# the name of the variable saved to the .mat file
varname=group_data

# the full path to the repository containing the RFAnalysis code.
repopath=/destination/path/for/FRF

# the name of the matlab command line executable, can include arguments
#   additional options, etc. SingleCompThread is enabled by default.
matlab_command=matlab

After generating a parameter file, one can prepare their data using the wrapper. From a bash terminal, one can execute the wrapper on his or her parameter file:

./RFSD/PrepExcelSheetForTreeBagging_wrapper.sh parameterfile.bash

NOTE that parameter files in the RFSD directory end with the file extension .bash. This is to help discern if a file is intended to be an executable (ending in .sh) or a parameters file.

The output from PrepExcelSheetForTreebagging will be a .mat file containing your named variable. Both the variable name and path must be modified in the ConstructModelTreeBag wrapper below (see: “Running the analysis").

R

If your data exists within an R data frame, R packages can be used to export the data frame as a .mat file explicitly (e.g. R.matlab: https://cran.r-project.org/web/packages/R.matlab/index.html). Since FRF accepts cell or numeric matrices, one can export a data frame as a cell matrix:

writeMat(con="…filepath",x=data)

or as a numeric matrix:

writeMat(con="…filepath",x=as.matrix(data))

R.matlab also enables one to load outputs into R for inspection (see: “loading outputs in R" under “Interpreting the Outputs").

python

If your data is extracted using python, scipy can be used to export the data as a .mat file explicitly. The data must be stored as a NxM numpy array or list where rows represent the cases (N) and columns represents the features.

data_mat=[]
data_mat.append(data)
scipy.io.savemat('group_data.mat', {'group_data':data_mat})

Scipy also enables one to load outputs into python for inspection (see: "loading outputs in python" under "Interpreting the Outputs")

CSV

CSVs (comma separated value files) can be implicitly handled by the FRF package. To use a CSV as an input, the following steps have to be taken:

  1. The CSV must be organized as a table (i.e. 2D matrix) with rows representing cases and columns representing features.
  2. The delimiter used by the CSV
  3. The CSV cannot contain any header information (i.e. a top row depicting the column headers).
  4. Missing or blank cells (see: “How to deal with missing data") must be represented by NaNs.

How to deal with missing data

Our RF approach can handle missing data via use of “surrogate splits". Specifically, when the RF model encounters missing data in the training data set, it will generate a small RF from the bootstrapped training data only that attempts to fill that missing value. For sets of cases/variables where data is missing, the actions taken depends on whether the variable is categorical or continuous. For continuous variables, one should either the leave the cell blank (unless using CSVs) or replace all missing values with the value NaN (not a number). In the case where the variable is categorical, the variable must be left blank.

Surrogate splits are not a panacea. For example, consider the extreme situation where 95 percent of cases are missing the given feature. Even if one could successfully predict the feature value for the 95 percent of cases using other data, such a relationship would imply that the feature is derived from other features. Therefore, such a feature would not contribute anything unique to the model. In our experience, it is critical for the user to perform exploratory data analysis (EDA) on their own datasets. If more than 15 percent of a given case or a given variable is missing data, one should remove the given case or variable. One can toggle whether surrogate splits are used in the approach by modifying the parameter file (see: “Generating your parameter file").

Generating your parameter file

The RFSD and FRF packages contain bash wrappers for executing the algorithm from the command line. Parameter files are used to store the configuration for the analysis. Below are the parameters available for running the RFSD using the ConstructModelTreeBag_wrapper.sh. These parameters can be found in the TreeBagParamFile_example.bash. We use this configuration file in order to support the many user tuned parameters.

In order to understand the layout of this Parameters File, we provide some intuition. Comments precede parameter definitions for clarity. Variable assigment (in bash), requires no space around the = character.

File sections that follow the pattern below, describe a conditional relationshi This relationship dictates that under a condition of variable_name, subvariable_name_one and subvariable_name_two will be used and must be defined. Otherwise, then these values will be ignored.

variable_name=false
    subvariable_name_one=monkey
    subvariable_name_two=7

Additionally, the use of 0 as an index denotes a false state. These tools are written in matlab, and therefore assumes a 1 index of colunmns and arrays

Comments of the form below, are used to designate categories of defined variable

# =====================
# == Sectional Title
# =====================

Our example config file is found here.

Running the analysis

RFSD allows for multiple different workflows to analyze your data. The manual will cover two standard and two optional workflows:

  1. RFSD analysis
  2. RFSD power analysis
  3. [OPTIONAL] Rerun subgroup analysis
  4. [OPTIONAL] Perform community detection on a proximity matrix

RFSD analysis

Once you have prepared the parameter file, you can run the RFSD analysis using the ConstructModelTreeBag_wrapper.sh command.

 ./RFSD/ConstructModelTreeBag_wrapper.sh paramfile.bash

The runtime for ConstructModelTreeBag varies by multiple factors. Individual forests are affected by the complexity of the data, the number of trees, and model construction parameters like the number of variables examined per branch. Total runtime is affected by factors like the number of participants, and the type of validation performed. With OOB turned on each RF model increases in runtime as the number of subjects increases. With variable importance turned on, the runtime will increase by the number of variables.

Currently, we have not performed a rigorous analysis of how each factor affects runtime. For a 10-fold cross-validation with 3 repetitions on 1000 trees, the RF validation should take no more than an hour to complete, if out of bag (OOB) error and variable importance are turned off. It is recommended to perform a single large model (10 times as many trees) on one repetition when evaluating OOB error or variable importance; a single RF model may take 1-4 hours if these options are enabled.

Subgroup detection runtime depends on the type of classifer, and the number of edge densities explored, as set by the parameter file. With the defaults enabled, subgroup detection will take 4 hours on unsupervised or regression models, and up to 12 hours on supervised classifiation models.

RFSD power analysis

The RFSD power analysis can be used to estimate statistical power for unsupervised and supervised analyses. A parameter for the expected effect size (as measured by accuracy, mean absolute error, or subgroup similarity) must be selected by the user to calculate statistical power. A range of parameters can be selected if the expected effect size is unknown. A sample size must also be selected by the user. To run multiple sample sizes, one will need to generate parameter files for each sample size.

RFSD power analysis requires preparation of the RFSD parameter file. Because the power analysis is computationally intensive, we recommended using a computing cluster to enable parallelization of the simulations. The parameter file uses the same formatting as for the RFSD analysis, and can be found here.

Once you have prepared the parameter file, you can run the RFSD analysis using the RunRFSDPowerAnalysis_wrapper.sh command.

 ./RFSD/RunRFSDPowerAnalysis_wrapper.sh RFSD_paramfile.bash

Interpreting the outputs

Outputs are stored as .mat files, which encapsulate several matlab variables. Although there are multiple intermediate outputs, only two contain the final outputs from the analysis:

  1. ExcelOutput.mat -- as generated by PrepExcelSheetForTreeBagging.m and named by parameter output_matfile,
  2. filename.mat -- as generated by ConstructModelTreeBag.m and named by parameter filename, contains relevant outputs from the RF models
  3. filename_output/subgroup_community_assignments.mat -- as generated by RunAndVisualizeCommunityDetection.m and named by parameter filename, contains relevant outputs from the subgroup detection
  4. RFSD_filename_output.mat -- as generated by RunRFSDPowerAnalysis.m and named by parameter output_directory, contains outputs from the RFSD power analysis, such as statistical power and false positive rates.
  5. If a supervised classifier is selected, additional intermediate outputs will be saved in: filename_output_groupX/ -- as generated by VisualizeTreeBaggingResults.m and named by parameter filename, contains relevant outputs from group (class) number X

Stored outputs

Data dictionaries for output files are specified below.

ExcelOutput.mat data dictionary
variable name output from used in matlab datatype R datatype python datatype dimensions null value description
group_data PrepExcelSheetForTreeBagging ConstructModelTreeBag, VisualizeTreeBaggingResults , RunRFSDPowerAnalysis numeric matrix data matrix numpy array 2 cannot be null Matlab formated matrix representing input data
filename.mat data dictionary
variable name output from used in matlab datatype R datatype python datatype dimensions minimum value maximum value null value description
accuracy ConstructModelTreebag VisualizeTreeBaggingResults numeric matrix data matrix numpy array 2-3 -1 1 cannot be null model performance: first dimension reflects statistic type (e.g. group accuracy, total performance, error, or correlation), second dimension reflects repetition/fold index, third dimension reflects repetition index (if CV is enabled).
features ConstructModelTreebag VisualizeTreeBaggingResults numeric matrix data matrix numpy array 1 0 +inf cannot be null #times each predictor is used
final_data ConstructModelTreebag VisualizeTreeBaggingResults numeric matrix data matrix numpy array 2 depends on dataset depends on dataset cannot be null data excluding the outcome variable, can be used with "subgroup_community_assignments" to evaluate features used
final_outcomes ConstructModelTreebag VisualizeTreeBaggingResults numeric matrix data matrix numpy array 1 depends on dataset depends on dataset cannot be null outcome_variable, can be used with "subgroup_community_assignments" to evaluate outcomes by subgroup
features ConstructModelTreebag VisualizeTreeBaggingResults numeric matrix data matrix numpy array 1 0 +inf cannot be null #times each predictor is used
group1class ConstructModelTreebag VisualizeTreeBaggingResults numeric matrix data matrix numpy array 1 -1 1 cannot be null individual performance for first dataset
group1predict ConstructModelTreebag VisualizeTreeBaggingResults numeric matrix data matrix numpy array 1 outcome dependent outcome dependent cannot be null model prediction for first dataset
group1scores ConstructModelTreebag VisualizeTreeBaggingResults numeric matrix data matrix numpy array 1 outcome dependent outcome dependent cannot be null model score for first dataset
group2class ConstructModelTreebag VisualizeTreeBaggingResults numeric matrix data matrix numpy array 1 -1 1 NaN individual performance for second dataset
group2predict ConstructModelTreebag VisualizeTreeBaggingResults numeric matrix data matrix numpy array 1 outcome dependent outcome dependent NaN model prediction for second dataset
group2scores ConstructModelTreebag VisualizeTreeBaggingResults numeric matrix data matrix numpy array 1 outcome dependent outcome dependent NaN model score for second dataset
npredictors ConstructModelTreebag VisualizeTreeBaggingResults numeric int32 int32 scalar 1 # of features NaN number of predictors used to grow each branch
outofbag_error ConstructModelTreebag VisualizeTreeBaggingResults numeric matrix data matrix numpy array 2 0 1 NaN Cumulative OOB error: first dimension reflects repetition index, second dimension reflects # of trees
outofbag_varimp ConstructModelTreebag VisualizeTreeBaggingResults numeric matrix data matrix numpy array 2 -1 1 NaN OOB measured variable importance: first dimension reflects repetition index, second dimension reflects feature index
permute_accuracy ConstructModelTreebag VisualizeTreeBaggingResults numeric matrix data matrix numpy array 2-3 -1 1 NaN Null model performance. See "accuracy" for format.
proxmat ConstructModelTreebag RunAndVisualizeCommunityDetection cell matrix list of data matrices list of numpy arrays 1 0 1 cannot be null NxN proximity matrices generated from the RF, one per forest is saved to limit space. N refers to the number of subjects. If group2_validate_only is set to true, then the proximity matrix will only reflect the independent testing dataset.
treebag ConstructModelTreebag ConstructModelTreeBag cell matrix containing RF objects N/A N/A 1 N/A N/A NaN this variable contains each model generated from the RF validation. The size of the matrix depends on the number of forests generated. Each model is stored as a TreeBagger class object, and cannot be loaded yet in R or python.
trimmed_features ConstructModelTreebag R/MATLAB/python data matrix numpy array 2-3 1 # of features NaN NaN If features were trimmed using the legacy "estimate_features" option, this variable contains an index of which features were used. We do not advise using this feature.
filename_output/subgroup_community_assignments.mat data dictionary
variable name output from used in matlab datatype R datatype python datatype dimensions minimum value maximum value null value description
community_subgroup_performance VisualizeTreeBaggingResults R/MATLAB/python cell matrix list of lists list of numpy arrays 1 -1 1 cannot be null Contains the model performance for each individual, split first by class, then by subgroup within class. Each subgroup list contains a matrix of number reflecting performance. Mean absolute error is used for regression, accuracy for classification.
subgroup_community_assignments VisualizeTreeBaggingResults R/MATLAB/python cell matrix list list 1 G1_1 GX_Xs cannot be null list of subgroups identified by the RFSD, X refers to the class, Xs refers to the subgroup within the class
subgroup_community_num VisualizeTreeBaggingResults R/MATLAB/python numeric matrix data matrix numpy array 2 1 +inf cannot be null numeric representation of subgroup_community_assignments: first column represents the class index, second column represents the subgroup index for that subject
subgroup_communities VisualizeTreeBaggingResults R/MATLAB/python cell matrix list of data matrices list of numpy arrays 1 1 +inf cannot be null list of subgroup communities for each class
subgroup_sorting_orders VisualizeTreeBaggingResults R/MATLAB/python cell matrix list of data matrices list of numpy arrays 1 1 # of cases cannot be null list of order of subjects in the other variables split by subgroup, can be used to query "filename_output.mat"
proxmat_subgroup_sorted VisualizeTreeBaggingResults R/MATLAB/python numeric_matrix data matrix numpy array 2 0 1 cannot be null The mean proximity matrix (NxN where N is the number of cases) sorted first by class and then by subgroup.
RFSD_filename_output.mat data dictionary
variable name output from used in matlab datatype R datatype python datatype dimensions minimum value maximum value null value description
observed_performance RunRFSDPowerAnalysis R/MATLAB/python numeric matrix data matrix numpy array 2 -1 1 cannot be null performance when H=1, depending on the question, multiple performances may be observed, rows contain different metrics, columns contain different runs
null_performance RunRFSDPowerAnalysis R/MATLAB/python numeric matrix data matrix numpy array 2 -1 1 cannot be null performance when H=0, depending on the question, multiple performance metrics may be calculated, rows contain different metrics, columns contain different runs
sample_size RunRFSDPowerAnalysis R/MATLAB/python numeric int32 int32 1 1 +inf cannot be null sample size specified by the user in the parameter file
statistical_power RunRFSDPowerAnalysis R/MATLAB/python numeric matrix data matrix numpy array 2 0 1 cannot be null statistical power for each performance metric. Rows contain different performance metrics, columns contain different runs. First row represents the study power.
false_positive RunRFSDPowerAnalysis R/MATLAB/python numeric matrix data matrix numpy array 2 0 1 cannot be null false positive rate for each performance metric. Rows contain different performance metrics, columns contain different runs. First row represents the study power.
performance_thresholds RunRFSDPowerAnalysis R/MATLAB/python numeric_matrix data matrix numpy array 1 0 1 cannot be null performance threshold(s) specified by the user

Data dictionaries will be expanded for non-critical outputs in upcoming releases. Please contact the development team via github for questions and requests.

Loading outputs externally

All output matlab files are stored in HDF5 format. Therefore, any software package that can read HDF5. Below are HDF5 packages that should work with our formats.

The other packages below should also be able to load the outputs.

Loading outputs in R

Output matlab files can be read into R using the R.matlab package.

FRF_outputs <- readmat(con="filename.mat")

Outputs are represented as integers, data matrices and lists. Please consult the respective data dictionary for variable names and formatting. Data visualization R scripts are available and will be documented in a future release.

Loading outputs in python

Output matlab files can be read into python using scipy.

FRF_outputs = scipy.io.loadmat('filename.mat')

Outputs are represented as numpy arrays, integers, and lists. Please consult the respective data dictionary for variable names and formatting. Python data visualization code is planned for a future release.

Data visualizations

Examples of data visualizations are shown below. Examples will be added with new releases.

Performance histograms (classification): alt text

Performance histograms (regression): alt text

Proximity matrix (unsorted)

alt text

Proximity matrix (sorted)

alt text

Proximity matrix sorted by subgroup and class (if available) alt text

Subgroup community plot (unsorted)

each colored line represents a subject belonging to that color's subgroup, in provided order

alt text

Subgroup community plot (sorted)

same plot as above, except grouped by subgroup to represent population

alt text

Subgroup community plot with class labels (if available)

Two column plot. Left is classification; Right is subgroup within classification

alt text

Feature usage alt text

Feature error bar plots by subgroup alt text

Overall model performance alt text

Guide to determine community detection parameters

Under Construction

Post-hoc strategies for validating subgroups

Under Construction

known bugs and issues

Documentation issues

  • Guides for subgroup validation and community detection need to be documented
  • Usage of our R scripts needs to be documented
  • Data dictionaries for intermediate files need to be documented
  • Figure dictionary needs to be generated
  • Data visualizations need to be finished
  • holdout procedure for twin studies is undocumented

known bugs

  • Unsupervised RFSD power analysis may produce inaccurate performance estimates
  • Currently, output results are split into a "group1" and "group2" variable even when "use_group2_data" is set to "false". In this case, "group1" refers to the first half of all rows in the input data, "group2" refers to the second half. This bug remains due to backwards compatability and will be resolved in the next release.

planned features

  • enable hyperparameter tuning for RFs, currently, one needs to rerun the FRF to tune hyperparameters
  • test and refine holdout code for twin studies

Functional Random Forest (FRF)

Preparing your data

Generating your data file

How to deal with wave/non-anchored data

Generating your parameter file

How to determine optimal function parameters

Running the analysis

Correlation trajectory approach

Random forest approach

Interpreting the outputs

Stored outputs

Data visualizations

Post-hoc strategies for validating subgroups

About

License:Other


Languages

Language:MATLAB 87.6%Language:Shell 9.6%Language:Python 2.8%