How sentences are tokenized?
MagedSaeed opened this issue · comments
Maged Saeed commented
Thanks for the great software.
Just a question to tokenize my text accordingly, how the sentence markers are added internally as mentioned in the docs? Are they added by splits of \n?
Kenneth Heafield commented
lmplz
and query
treat '\n' in the data as a sentence split. A sentence split implicitly conditions on <s>
and appends </s>
.
Maged Saeed commented
Thanks for your reply and clarification @kpu