zaemyung / sentsplit

A flexible sentence segmentation library using CRF model and regex rules

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add heuristics to determine sentence split position when max_len is reached

zaemyung opened this issue · comments

Example

A strike is "not in the offing" - William J. Fallon Admiral William J. Fallon, the commanding officer of United States Central Command which is responsible for the Middle East, East Africa and Central Asia, speaking in Monday's Financial Times, said that a strike againt Iran is "not in the offing." "None of this is helped by the continuing stories that just keep going around and around and around that any day now there will be another war which is just not where we want to go," Fallon continued. ...

A strike is "not in the offing" - William J. Fallon Admiral William J. Fallon, the commanding officer of United States Central Command which is responsible for the Middle East, East Africa and Central Asia, speaking in Monday's Financial Times, said that a strike againt Iran is "not in the offing." "None of this is helped by the continuing stories that just keep going around and around and around that any day now there will be another war which is just not where we want to go," Fallon continue [SPLITTED HERE because it reached max_len = 500]

d. ~

TODO

  • When max_len is reached, heuristically determine the position of segmentation.
  • For example, we can have the sentence segmented right after not in the offering." or we want to go,"
  • It should avoid splitting in the middle of a word.