kpu / kenlm

KenLM: Faster and Smaller Language Model Queries

Home Page:http://kheafield.com/code/kenlm/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How sentences are tokenized?

MagedSaeed opened this issue · comments

Thanks for the great software.

Just a question to tokenize my text accordingly, how the sentence markers are added internally as mentioned in the docs? Are they added by splits of \n?

lmplz and query treat '\n' in the data as a sentence split. A sentence split implicitly conditions on <s> and appends </s>.

Thanks for your reply and clarification @kpu