Will English text be piece-cut in this project?

Question

Will English text be piece-cut in this project?

c0derm4n opened this issue 5 years ago · comments

NiklausDu commented 5 years ago

Describe the bug
A clear and concise description of what the bug is.

Minimum Reproducible Example
A short code snippet which reproduces the exception

Expected behavior
A clear and concise description of what you expected to happen.

Additional context
Add any other context about the problem here.

when i use Bert as Classifier, will the input text use "wordpiece_tokenizer"? or just cutted by space?

Madison May · Answer 1 · Tue Sep 24 2019 19:22:15 GMT+0800 (China Standard Time)

BERT does use the wordpiece tokenizer. See: https://github.com/IndicoDataSolutions/finetune/blob/development/finetune/base_models/bert/tokenizer.py#L177

NiklausDu · Answer 2 · Wed Sep 25 2019 16:25:01 GMT+0800 (China Standard Time)

When i check the code, Classifier(base_model=RoBERTa) will use class RoBERTaEncoder(GPT2Encoder) to encode the input text, which is not related to the wordpiece tokenizer mentioned above. Am i right?