Detect and normalize poorly encoded categorical splits
vruusmann opened this issue · comments
Villu Ruusmann commented
When performing the binarization of categorical features (eg. using LabelBinarizer
) instead of integer-encoding them (eg. using LabelEncoder
), then splits of categorical values are encoded as double comparisons against a reference value 1.0000000180025095E-35
(the smallest 64-bit value that is still greater than 0
):
<Node id="8" score="0.0745191789134865" recordCount="39">
<SimplePredicate field="lookup(Employment)" operator="lessOrEqual" value="1.0000000180025095E-35"/>
</Node>
It would be much more transparent and space efficient to encode the same as integer comparisons against 0
and 1
reference values:
<Node id="8" score="0.0745191789134865" recordCount="39">
<SimplePredicate field="lookup(Employment)" operator="equal" value="0"/>
</Node>
and/or:
<Node id="8" score="0.0745191789134865" recordCount="39">
<SimplePredicate field="lookup(Employment)" operator="notEqual" value="1"/>
</Node>