Anylee2142 / MDLP

:memo: ML Paper implementation: Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Outlines

Implementation of MDLP (http://yaroslavvb.com/papers/fayyad-discretization.pdf)

The paper suggests one of discretization methods (broader concept of binning).

  1. Its purpose is to get cutpoints which converts numerical columns into categorical ones.
  2. To get the cutpoints, its hypothesis assumes Occam's razor, which states that selecting simpler, shorter hypothesis(length) is desirable.
  3. This leads to minimizes statistics calculated from a numerical column: length(P(H)) + length(P(H|T)).
  • length is measured by Entropy
  • and it also uses target information

You can refer to below formulas for more details.

formula_1

formula_2
formula_3

formula_4
formula_5

Features

  • Can be executed as multiprocess by n_jobs
  • Works as sklearn way

How to run

from discretization.mdlp import *
mdlp = MDLP(con_features=data.feature_names, base=2, max_cutpoints=2, n_jobs=-1)
mdlp.fit_transform(X)

Random thoughts on the paper

Pros = Discretization is done related with target information which leads to performance
Cons = If features are correlated, discretized features can be redundant (Needs of feature selection)

Results

You can check the results in the notebook files. Model performances are improved in most cases.

References

About

:memo: ML Paper implementation: Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning


Languages

Language:Jupyter Notebook 82.2%Language:Python 17.8%