joseTamezPena / FCA

Heuristic Multidimensional Correlation Analysis

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Iterative Decorrelation Analysis (IDeA) and the Unit of Measurement Preserving Spatial Transformation Matrices (UPSTM)

Fig. 1. The weights ($w_j^i$) of the UPSTM matrix (W) are estimated by the IDeA algorithm.

_____________________________________________________________________________________________________________

Many multidimensional/multimodality data sets contain continuous features that are co-linear, correlated or have some association between them. The goal of spatial transformations is to find a set of latent variables with minimum data correlation; hence downstream data analysis be simplified. Common data transformation matrices include statistically driven approaches such as principal component analysis (PCA), explanatory factor analysis (EFA), and canonical-correlation analysis (CCA). An algoritm alternative for these two statistical approaches is the Iterative Decorrelation Analysis (HMCA). The main advantage of the iterative approach is that it is driven by specific output requirements. The specific requirements are:

  1. All output variables $Q=(q_1,...q_n)$ have a parent input variable $X=(x_1,...x_n)$ (See Fig 1.)

  2. The user can specify the maximum significant correlation coefficient among the returned variable set. i.e., None of the correlation pairs should have statistically significant correlation greater than the user specified goal.

    • i.e., if the correlation of the variables $(q_i,q_j)$ is lower than the maximum correlation or if the correlation is not statistically significant, the algorithm does not try to remove the correlation among these variables.

    • The correlation measure can be user specified. i.e. Pearson's $r$, Spearman's $ρ$ or Kendall's $τ$.

These requirements are addressed by an heuristic algorithm that creates a goal-driven spatial transformation matrix (GDSTM). Besides a correlation method, and a correlation goal, the algorithm requires a linear modeling function, hence users can specify linear fit, or robust fits. For Machine learning applications the user can specify the target outcome.

Software

The HMCA algorithm is implemented in the FRESA.CAD R package

Installing the latest version:

library(devtools)
install_github("joseTamezPena/FRESA.CAD")

Installing from CRAN

install("FRESA.CAD")

Usage

library("FRESA.CAD")
data('iris')

## HMCA Decorrelation at 0.25 threshold, pearson and fast estimation 
irisDecor <- IDeA(iris,thr=0.25)

### Print the latent variables
print(getLatentCoefficients(irisDecor))

Output:

$La_Sepal.Length
Sepal.Length Petal.Length 
   1.0000000   -0.4089223 

$La_Sepal.Width
Sepal.Length  Sepal.Width Petal.Length 
  -0.5611860    1.0000000    0.3352667 

$La_Petal.Width
Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
   0.1250483   -0.2228285   -0.4904624    1.0000000 
   

Advanced Examples

This repository show some examples of the FRESA.CAD::GDSTMDecorrelation(), FRESA.CAD::getLatentCoefficients(decorrelatedobject) andFRESA.CAD:: filteredFit() functions.

  • irisexample.R showcase the effect of the HMCA algorithm on the iris data set.

    • Here an example of the output
  • ParkisonAnalysis_TrainTest.Rmd is a demo shows the use of GDSTM and BSWiMS to gain insight of the features associated with a relevant outcome. Highlight process and functions that will aid authors to discern and statistically describe the relevant features associated with an specific outcome.

  • FDeA_Options_testing.Rmd runs a script of the Vehicle data set showcasing the use GDSTMDecorrelation() for decorrelation, feature analysis and ML (NB).

  • FDeA_Options_testing_mfeat.Rmd runs a simpler script on the multiple feature dataset.

  • FDeA_ML_testing_sonar.Rmd is an example of how to run filteredFit(): (NB and LASSO) with decorrelation on the Sonar dataset

  • FDeA_ML_testing_ARCENE.Rmd is an example of filteredFit() (Logistic LASSO) and with decorrelation on the Arcene dataset. (Due to the large dimensions of the ARCENE dataset the script will take several minutes to run)

About

Heuristic Multidimensional Correlation Analysis

License:GNU General Public License v3.0


Languages

Language:Rich Text Format 57.5%Language:R 42.5%