Detect and normalize poorly encoded categorical splits

Question

Detect and normalize poorly encoded categorical splits

vruusmann opened this issue 5 years ago · comments

When performing the binarization of categorical features (eg. using LabelBinarizer) instead of integer-encoding them (eg. using LabelEncoder), then splits of categorical values are encoded as double comparisons against a reference value 1.0000000180025095E-35 (the smallest 64-bit value that is still greater than 0):

<Node id="8" score="0.0745191789134865" recordCount="39">
            <SimplePredicate field="lookup(Employment)" operator="lessOrEqual" value="1.0000000180025095E-35"/>
</Node>

It would be much more transparent and space efficient to encode the same as integer comparisons against 0 and 1 reference values:

<Node id="8" score="0.0745191789134865" recordCount="39">
            <SimplePredicate field="lookup(Employment)" operator="equal" value="0"/>
</Node>

and/or:

<Node id="8" score="0.0745191789134865" recordCount="39">
            <SimplePredicate field="lookup(Employment)" operator="notEqual" value="1"/>
</Node>