Will English text be piece-cut in this project?
c0derm4n opened this issue · comments
Describe the bug
A clear and concise description of what the bug is.
Minimum Reproducible Example
A short code snippet which reproduces the exception
Expected behavior
A clear and concise description of what you expected to happen.
Additional context
Add any other context about the problem here.
when i use Bert as Classifier, will the input text use "wordpiece_tokenizer"? or just cutted by space?
BERT does use the wordpiece tokenizer. See: https://github.com/IndicoDataSolutions/finetune/blob/development/finetune/base_models/bert/tokenizer.py#L177
When i check the code, Classifier(base_model=RoBERTa) will use class RoBERTaEncoder(GPT2Encoder) to encode the input text, which is not related to the wordpiece tokenizer mentioned above. Am i right?