IndicoDataSolutions / finetune

Scikit-learn style model finetuning for NLP

Home Page:https://finetune.indico.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Will English text be piece-cut in this project?

c0derm4n opened this issue · comments

Describe the bug
A clear and concise description of what the bug is.

Minimum Reproducible Example
A short code snippet which reproduces the exception

Expected behavior
A clear and concise description of what you expected to happen.

Additional context
Add any other context about the problem here.

when i use Bert as Classifier, will the input text use "wordpiece_tokenizer"? or just cutted by space?

When i check the code, Classifier(base_model=RoBERTa) will use class RoBERTaEncoder(GPT2Encoder) to encode the input text, which is not related to the wordpiece tokenizer mentioned above. Am i right?