facebookresearch / LASER

Language-Agnostic SEntence Representations

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Suggestion: allow turning off punctuation normalization

avidale opened this issue · comments

I suggest adding a constructor argument normalize_punct to LaserTokenizer, with the default value of True, and run punctuation normalization and nonprintable character removal only if it is True.

This will make the implementation more consistent with other flags for text normalization (lower_case and descape), and will allow experimenting with turning this step on and off.

This can be implemented in the same PR as the one that introduces the perl compatibility flag.