How sentences are tokenized?

Question

How sentences are tokenized?

MagedSaeed opened this issue 2 years ago · comments

Thanks for the great software.

Just a question to tokenize my text accordingly, how the sentence markers are added internally as mentioned in the docs? Are they added by splits of \n?

Kenneth Heafield · Answer 1 · Sat Dec 31 2022 21:04:55 GMT+0800 (China Standard Time)

lmplz and query treat '\n' in the data as a sentence split. A sentence split implicitly conditions on <s> and appends </s>.

Maged Saeed · Answer 2 · Sun Jan 01 2023 23:26:24 GMT+0800 (China Standard Time)

Thanks for your reply and clarification @kpu