Suggestion: allow turning off punctuation normalization
avidale opened this issue · comments
David Dale commented
I suggest adding a constructor argument normalize_punct
to LaserTokenizer
, with the default value of True
, and run punctuation normalization and nonprintable character removal only if it is True.
This will make the implementation more consistent with other flags for text normalization (lower_case
and descape
), and will allow experimenting with turning this step on and off.
This can be implemented in the same PR as the one that introduces the perl compatibility flag.