munnellg / SpellChecker

A simple spell checker using bit vectors

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Spell Checker

An incredibly simple, stupid spell checker written over the course of an hour or two.

Martin Schraagen's PhD thesis "Aspects of Record Linkage" presents a method of reducing the search space when comparing records in a database using bit vectors. The method translates very easily to a spell checker which is what is presented here.

Bit vectors encode strings as sequences of flags based on a vocabulary. For example, given the alphabet "abcd", the string "cab" would be encoded as { 1, 1, 1, 0 } i.e. it contains an 'a', 'b', and a 'c' but not a 'd'. "Dab" would be encoded as { 1, 1, 0, 1 }.

Bit vectors can be used to construct a tree which may be used to find terms that contain specific characters. Hence given a spelling error such as "the" misspelled as "teh", the tree may be searched for correct words that contain the letters 't', 'e', and 'h'. Search is followed by the application of a string similarity function that selects the most probable correction from a pool of candidates.

The search process can also allow for errors where characters are omitted or added to words accidentally, e.g. "exit" being misspelled "ext". Enabling this tolerance is simply a matter of flipping bits in the bit vector up to a maximum error while the tree is being searched.

Given a pool of candidate corrections retrieved from the tree, this program uses Jaro-Winkler distance to select the most likely spelling correction.

Making and Running

The application may be built on Linux simply by calling make from the root directory. The binary takes one or two arguments as input. The first is mandatory and is a list of words to be added to the dictionary. These are stored in a case sensitive manner.

If a second argument is provided then the contents of that file are used to evaluate the performance of the spell checker. Otherwise the application launches in interactive mode and allows you to query for correct spellings of words.

The format for executing the program is:

./check dictionary.txt [evaluation.txt]

Performance

The standard datasets for evaluating a spell checker seems to be those surrounding Peter Norvig's How to Write a Spelling Corrector. These are curated on Kaggle, but are not included in this repository. You will need to obtain them yourself.

The dictionary for corrections was built from the Debian Linux /usr/share/dict/american-english file. You will need to provide your own dictionary file if you want to test this application.

The total runtime and accuracy for the application on each of the spell checker datasets is reported below. The run times are extremely approximate as they include the time taken to build the dictionary and the time to check if a query term is "known".

The evaluation was performed using a case insensitive dictionary. Execution time was computed using the Linux time utility.

The evaluation produces two sets of scores:

  • The first is an overall evaluation of the performance of the spell checker. This includes errors where the correct word is not part of the dictionary i.e. it is impossible for the spell checker to find the correct word
  • The second excludes mistakes that are the result of words missing from the dictionary i.e. the spell checker should have found the right answer but did not.

In some cases a "misspelled" word will not be corrected because it is included in the dictionary. For example, for contextual reasons the word "dongle" should be corrected to "dangle". However, because "dongle" is a valid word, this spell checker will not correct it. A count of such cases is reported by the program.

aspell

Processed 450 terms (24 unknown) : 531 queries (24 unknown)
20 "misspellings" were found in the dictionary
All Queries   : 0.514124 correct
Known Queries : 0.538462 correct

real	0m2.553s
user	0m2.512s
sys	0m0.036s

birkbeck

Processed 6136 terms (328 unknown) : 36133 queries (952 unknown)
3782 "misspellings" were found in the dictionary
All Queries   : 0.337392 correct
Known Queries : 0.346522 correct

real	1m26.304s
user	1m26.168s
sys	0m0.124s

spell-testset1

Processed 141 terms (4 unknown) : 270 queries (4 unknown)
5 "misspellings" were found in the dictionary
All Queries   : 0.659259 correct
Known Queries : 0.669173 correct

real	0m1.257s
user	0m1.208s
sys	0m0.044s

spell-testset2

Processed 363 terms (13 unknown) : 400 queries (15 unknown)
16 "misspellings" were found in the dictionary
All Queries   : 0.682500 correct
Known Queries : 0.709091 correct

real	0m2.044s
user	0m2.004s
sys	0m0.036s

wikipedia

Processed 1922 terms (85 unknown) : 2455 queries (98 unknown)
50 "misspellings" were found in the dictionary
All Queries   : 0.710794 correct
Known Queries : 0.740348 correct

real	0m11.102s
user	0m11.020s
sys	0m0.076s

About

A simple spell checker using bit vectors


Languages

Language:C 98.0%Language:Makefile 2.0%