wolfgarbe / SymSpellCompound

SymSpellCompound: compound aware automatic spelling correction

Home Page:https://seekstorm.com/blog/sub-millisecond-compound-aware-automatic.spelling-correction/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SymSpellCompound

SymSpellCompound has been integrated into SymSpell. Please visit the SymSpell repository!


Compound aware automatic spelling correction

SymSpellCompound supports compound aware automatic spelling correction of multi-word input strings.
It is built on top of SymSpell's 1 million times faster spelling correction algorithm.

1. Compound splitting & decompounding

SymSpell assumed every input string as single term. SymSpellCompound supports compound splitting / decompounding with three cases:

  1. mistakenly inserted space within a correct word led to two incorrect terms
  2. mistakenly omitted space between two correct words led to one incorrect combined term
  3. multiple input terms with/without spelling errors

Splitting errors, concatenation errors, substitution errors, transposition errors, deletion errors and insertion errors can by mixed within the same word.

2. Automatic spelling correction

  • Large document collections make manual correction infeasible and require unsupervised, fully-automatic spelling correction.
  • In conventional spelling correction of a single token, the user is presented with spelling correction suggestions.
    For automatic spelling correction of long multi-word text the the algorithm itself has to make an educated choice.

Examples:

- whereis th elove hehad dated forImuch of thepast who couqdn'tread in sixthgrade and ins pired him
+ where is the love he had dated for much of the past who couldn't read in sixth grade and inspired him  (9 edits)

- in te dhird qarter oflast jear he hadlearned ofca sekretplan y iran
+ in the third quarter of last year he had learned of a secret plan by iran  (10 edits)

- the bigjest playrs in te strogsommer film slatew ith plety of funn
+ the biggest players in the strong summer film slate with plenty of fun  (9 edits)

- Can yu readthis messa ge despite thehorible sppelingmsitakes
+ can you read this message despite the horrible spelling mistakes  (9 edits)

Performance

0.2 milliseconds / word
5000 words / second (single core on 2012 Macbook Pro)

Applications

  • Query correction (10–15% of queries contain misspelled terms),
  • Chatbots,
  • OCR post-processing,
  • Automated proofreading.