dmlc / gluon-nlp

NLP made easy

Home Page:https://nlp.gluon.ai/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

strip_accents should be None by default in WordPiece

sxjscience opened this issue · comments

Description

@leezu @szha @xinyual I noticed that we may need to set strip_accents to None in

strip_accents: bool = False, lowercase: bool = False,
so that it will be turned on when lowercase is True.

This may impact the performance.

Error Message

(Paste the complete error message, including stack trace.)

To Reproduce

(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)

Steps to reproduce

(Paste the commands you ran that produced the error.)

What have you tried to solve it?

Environment

We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below:

curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | python

# paste outputs here

However, accents may have certain meanings for lots of languages, e.g., mochte vs. möchte. Thus, we may try to turn it off in nlp_process.

Thus, we may try to turn it off in nlp_process.

Do you mean exposing an option in nlp_process or changing the defaults in nlp_process? As English is a special case that doesn't care much about accents, I suggest we must keep the option to keep accents in nlp_process.