RomeroBarata / dcme

An R Package to Compute Data Complexity Measures

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

dcme

Overview

The dcme package provides functions to compute data complexity measures.

Installation

dcme is under development and not yet available on CRAN. You can install the development version using the devtools package as follows:

# install.packages("devtools")
devtools::install_github("RomeroBarata/dcme")

Data Complexity Measures

The following complexity measures are currently implemented:

Simple Measures

  • num_examples: Number of Observations
  • num_examples_majority: Number of Observations in the Majority Class
  • num_examples_minority: Number of Observations in the Minority Class
  • num_features: Number of Features
  • num_features_numeric: Number of Numeric Features
  • num_features_binary: Number of Binary Features
  • num_features_categorical: Number of Categorical Features
  • num_classes: Number of Classes
  • proportion_examples_majority: Proportion of Majority Examples
  • proportion_examples_minority: Proportion of Minority Examples
  • proportion_features_numeric: Proportion of Numeric Features
  • proportion_features_binary: Proportion of Binary Features
  • proportion_features_categorical: Proportion of Categorical Features
  • IR: Imbalance Ratio

num_examples_majority, num_examples_minority, proportion_examples_majority, proportion_examples_minority, and IR are defined only for binary data sets.

Statistical Measures

  • sd_ratio: Geometric Mean Ratio of Standard Deviations
  • corr_abs: Mean Absolute Correlation Coefficient

Measures of Overlap of Individual Feature Values

  • F1: Fisher's Discriminant Ratio
  • F2: Volume of Overlap Region

Unfortunately the F1 and F2 measures are implemented only for binary data sets. General versions will be made available soon.

Measures of Separability of Classes

  • N2: Ratio of Average Intra/Inter Class 1-NN Distance
  • N3: Error Rate of 1-NN Classifier

Measures of Geometry, Topology, and Density of Manifolds

  • N4: Nonlinearity of the 1-NN Classifier
  • T2: Average Number of Points per Dimension

References

Definitions and explanations of most functions implemented in the dcme package can be found in the following literature:

[1] Michie, D., Spiegelhalter, D. J., & Taylor, C. C. (1994). Machine learning, neural and statistical classification.

[2] Ho, T. K., & Basu, M. (2002). Complexity measures of supervised classification problems. IEEE transactions on pattern analysis and machine intelligence, 24(3), 289-300.

About

An R Package to Compute Data Complexity Measures


Languages

Language:R 100.0%