whackashoe / SymSpellCompound

SymSpellCompound: compound aware automatic spelling correction

Home Page:https://medium.com/@wolfgarbe/symspellcompound-10ec8f467c9b

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SymSpellCompound

Compound aware automatic spelling correction

SymSpellCompound supports compound aware automatic spelling correction of multi-word input strings.
It is built on top of SymSpell's 1 million times faster spelling correction algorithm.

1. Compound splitting & decompounding

SymSpell assumed every input string as single term. SymSpellCompound supports compound splitting / decompounding with three cases:

  1. mistakenly inserted space within a correct word led to two incorrect terms
  2. mistakenly omitted space between two correct words led to one incorrect combined term
  3. multiple input terms with/without spelling errors

Splitting errors, concatenation errors, substitution errors, transposition errors, deletion errors and insertion errors can by mixed within the same word.

2. Automatic spelling correction

  • Large document collections make manual correction infeasible and require unsupervised, fully-automatic spelling correction.
  • In conventional spelling correction of a single token, the user is presented with spelling correction suggestions.
    For automatic spelling correction of long multi-word text the the algorithm itself has to make an educated choice.

Examples:

- whereis th elove hehad dated forImuch of thepast who couqdn'tread in sixthgrade and ins pired him
+ where is the love he had dated for much of the past who couldn't read in sixth grade and inspired him  (9 edits)

- in te dhird qarter oflast jear he hadlearned ofca sekretplan y iran
+ in the third quarter of last year he had learned of a secret plan by iran  (10 edits)

- the bigjest playrs in te strogsommer film slatew ith plety of funn
+ the biggest players in the strong summer film slate with plenty of fun  (9 edits)

- Can yu readthis messa ge despite thehorible sppelingmsitakes
+ can you read this message despite the horrible spelling mistakes  (9 edits)

Performance

0.2 milliseconds / word
5000 words / second (single core on 2012 Macbook Pro)

Applications

Query correction, OCR post-processing, orthographic quality assessment, agent & chat bot conversation.

Frequency dictionary

The word frequency list was created by intersecting the two lists mentioned below. By reciprocally filtering only those words which appear in both lists are used. Additional filters were applied and the resulting list truncated to ≈ 80,000 most frequent words.

Blog Posts: Algorithm, Benchmarks, Applications

1000x Faster Spelling Correction algorithm
1000x Faster Spelling Correction: Source Code released
Fast approximate string matching with large edit distances in Big Data
Very fast Data cleaning of product names, company names & street names

Copyright (C) 2017 Wolf Garbe
Version: 1.0
Author: Wolf Garbe <wolf.garbe@faroo.com>
Maintainer: Wolf Garbe <wolf.garbe@faroo.com>
License:
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU Lesser General Public License, 
version 3.0 (LGPL-3.0) as published by the Free Software Foundation.
http://www.opensource.org/licenses/LGPL-3.0

Usage: multiple words + Enter: Display spelling suggestions Enter without input: Terminate the program



About

SymSpellCompound: compound aware automatic spelling correction

https://medium.com/@wolfgarbe/symspellcompound-10ec8f467c9b


Languages

Language:C# 100.0%