astariul / Sentencize.jl

Smallish library for sentence splitting in Julia

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Sentencize.jl

test status

Text to sentence splitter using heuristic algorithm.

This module is a port of the python package sentence-splitter.

The module allows splitting of text paragraphs into sentences. It is based on scripts developed by Philipp Koehn and Josh Schroeder for processing the Europarl corpus.

Usage

The module uses punctuation and capitalization clues to split plain text into a list of sentences :

import Sentencize

sen = Sentencize.split_sentence("This is a paragraph. It contains several sentences. \"But why,\" you ask?")
println(sen)
# ["This is a paragraph.", "It contains several sentences.", "\"But why,\" you ask?"]

You can specify another language than English:

sen = Sentencize.split_sentence("Brookfield Office Properties Inc. (« BOPI »), dont les actifs liés aux immeubles directement...", lang="fr")
println(sen)
# ["Brookfield Office Properties Inc. (« BOPI »), dont les actifs liés aux immeubles directement..."]

You can specify your own non-breaking prefixes file:

sen = Sentencize.split_sentence("This is an example.", prefix_file="my_prefixes.txt", lang=missing)

Or even pass the prefixes as a dictionary:

sen = Sentencize.split_sentence("This is another example. Another sentence.", prefixes=Dict("example" => Sentencize.default))
# ["This is another example. Another sentence."]

Languages

Currently supported languages are :

  • Catalan (ca)
  • Czech (cs)
  • Danish (da)
  • Dutch (nl)
  • English (en)
  • Finnish (fi)
  • French (fr)
  • German (de)
  • Greek (el)
  • Hungarian (hu)
  • Icelandic (is)
  • Italian (it)
  • Latvian (lv)
  • Lithuanian (lt)
  • Norwegian (Bokmål) (no)
  • Polish (pl)
  • Portuguese (pt)
  • Romanian (ro)
  • Russian (ru)
  • Slovak (sk)
  • Slovene (sl)
  • Spanish (es)
  • Swedish (sv)
  • Turkish (tr)

About

Smallish library for sentence splitting in Julia

License:Other


Languages

Language:Julia 100.0%