miriamspsantos / pycol

The Python Class Overlap Libray (pycol) assembles a comprehensive set of complexity measures associated with the characterization of the Class Overlap problem.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

contributions welcome

pycol: Python Class Overlap Library

The Python Class Overlap Library (pycol) assembles a set of data complexity measures associated to the problem of class overlap.

The combination of class imbalance and overlap is currently one of the most challenging issues in machine learning. However, the identification and characterisation of class overlap in imbalanced domains is a subject that still troubles researchers in the field as, to this point, there is no clear, standard, well-formulated definition and measurement of class overlap for real-world domains.

This library characterises the problem of class overlap according to multiple sources of complexity, where four main class overlap representations are acknowledged: Feature Overlap, Instance Overlap, Structural Overlap, and Multiresolution Overlap.

Existing open-source implementations of complexity measures include the DCoL (C++), ECoL, and the recent ImbCoL, SCoL, and mfe packages (R code). There is also pymfe in Python. Regarding class overlap measures, these packages consider the implementation of the following: F1, F1v, F2, F3, F4, N1, N2, N3, N4, T1 and LSCAvg. ImbCoL further provides a decomposition by class of the original measures and SCoL focuses on simulated complexity measures. In order to foster the study of a more comprehensive set of measures of class overlap, we provide an extended Python library, comprising the class overlap measures included in the previous packages, as well as an additional set of measures proposed in recent years. Furthermore, this library implements additional adaptations of complexity measures to class imbalance.

Overall, pycol characterises class overlap as a heterogeneous concept, comprising distinct sources of complexity, and the following measures are implemented:

Feature Overlap:

  • F1: Maximum Fisher's Discriminat Ratio
  • F1v: Directional Vector Maximum Fisher's Discriminat Ratio
  • F2: Volume of Overlapping Region
  • F3: Maximum Individual Feature Efficiency
  • F4: Collective Feature Efficiency
  • IN: Input Noise

Instance Overlap:

  • R-value
  • Raug: Augmented R-value
  • degOver
  • N3: Error Rate of the Nearest Neighbour Classifier
  • SI: Separability Index
  • N4: Non-Linearity of the Nearest Neighbour Classifier
  • kDN: K-Disagreeing Neighbours
  • D3: Class Density in the Overlap Region
  • CM: Complexity Metric Based on k-nearest neighbours
  • wCM: Weighted Complexity Metric
  • dwCM: Dual Weighted Complexity Metric
  • Borderline Examples
  • IPoints: Number of Invasive Points

Structural Overlap:

  • N1: Fraction of Borderline Points
  • T1: Fraction of Hyperspheres Covering Data
  • Clst: Number of Clusters
  • ONB: Overlap Number of Balls
  • LSCAvg: Local Set Average Cardinality
  • DBC: Decision Boundary Complexity
  • N2: Ratio of Intra/Extra Class Nearest Neighbour Distance
  • NSG: Number of samples per group
  • ICSV: Inter-class scale variation

Multiresolution Overlap:

  • MRCA: Multiresolution Complexity Analysis
  • C1: Case Base Complexity Profile
  • C2: Similarity-Weighted Case Base Complexity Profile
  • Purity
  • Neighbourhood Separability

For more information regarding the specified complexity measures, please refer to the following paper:

Santos, M. S., Abreu, P. H., Japkowicz, N., Fernández, A., Soares, C., Wilk, S., & Santos, J. (2022). On the joint-effect of Class Imbalance and Overlap: A Critical Review, accepted for publication in Artificial Intelligence Review..

If you would like to use the images provided in the paper to illustrate your own manuscript, report, blog, or presentation, they are available here.

Usage Example:

The dataset folder contains some datasets with binary and multi-class problems. All datasets are numerical and have no missing values. The complexity.py module implements the complexity measures. To run the measures, the Complexity class is instantiated and the results may be obtained as follows:

complexity = Complexity("dataset/61_iris.arff",distance_func="default",file_type="arff")

# Feature Overlap
print(complexity.F1())
print(complexity.F1v())
print(complexity.F2())
# (...)

# Instance Overlap
print(complexity.R_value())
print(complexity.deg_overlap())
print(complexity.CM())
# (...)

# Structural Overlap
print(complexity.N1())
print(complexity.T1())
print(complexity.Clust())
# (...)

# Multiresolution Overlap
print(complexity.MRCA())
print(complexity.C1())
print(complexity.purity())
# (...)

Developer notes:

To submit bugs and feature requests, report at project issues.

Citation Request:

If you plan to use this library, please consider referring to the following paper:

Santos, M. S., Abreu, P. H., Japkowicz, N., Fernández, A., Soares, C., Wilk, S., & Santos, J. (2022). On the joint-effect of Class Imbalance and Overlap: A Critical Review, accepted for publication in Artificial Intelligence Review..

If you would like to use the images provided in the paper to illustrate your own manuscript, report, blog, or presentation, they are available here.

Licence:

The project is licensed under the MIT License - see the License file for details.

Acknowledgements:

Some complexity measures implemented on pycol are based on the implementation of pymfe. We also thank José Daniel Pascual-Triana for providing the implementation of ONB.

About

The Python Class Overlap Libray (pycol) assembles a comprehensive set of complexity measures associated with the characterization of the Class Overlap problem.

License:MIT License


Languages

Language:Python 100.0%