rxp90 / jsymspell

Java 8+ zero-dependency port of SymSpell: 1 million times faster through Symmetric Delete spelling correction algorithm

Home Page:https://medium.com/@wolfgarbe/1000x-faster-spelling-correction-algorithm-2012-8701fcd87a5f

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

JSymSpell

JSymSpell is a zero-dependency Java 8+ port of SymSpell

forthebadge

codecov Maven Central License MIT Open Source Love svg1

Overview

The Symmetric Delete spelling correction algorithm speeds up the process up by orders of magnitude.

It achieves this by generating delete-only candidates in advance from a given lexicon.

Setup

Add the latest JSymSpell dependency to your project

Getting Started

To start, we'll load the data sets of unigrams and bigrams:

Map<Bigram, Long> bigrams = Files.lines(Paths.get("src/test/resources/bigrams.txt"))
                                 .map(line -> line.split(" "))
                                 .collect(Collectors.toMap(tokens -> new Bigram(tokens[0], tokens[1]), tokens -> Long.parseLong(tokens[2])));
Map<String, Long> unigrams = Files.lines(Paths.get("src/test/resources/words.txt"))
                                  .map(line -> line.split(","))
                                  .collect(Collectors.toMap(tokens -> tokens[0], tokens -> Long.parseLong(tokens[1])));

Let's now create an instance of SymSpell by using the builder and load these maps. For this example we'll limit the max edit distance to 2:

SymSpell symSpell = new SymSpellBuilder().setUnigramLexicon(unigrams)
                                         .setBigramLexicon(bigrams)
                                         .setMaxDictionaryEditDistance(2)
                                         .createSymSpell();

And we are ready!

int maxEditDistance = 2;
boolean includeUnknowns = false;
List<SuggestItem> suggestions = symSpell.lookupCompound("Nostalgiais truly one of th greatests human weakneses", maxEditDistance, includeUnknowns);
System.out.println(suggestions.get(0).getSuggestion());
// Output: nostalgia is truly one of the greatest human weaknesses
// ... only second to the neck!

Custom String Distance Algorithms

By default, JSymSpell calculates Damerau-Levenshtein distance. Depending on your use case, you may want to use a different one.

Other algorithms to calculate String Distance that might result of interest are:

Here's an example using Hamming Distance:

SymSpell symSpell = new SymSpellBuilder().setUnigramLexicon(unigrams)
                                         .setStringDistanceAlgorithm((string1, string2, maxDistance) -> {
                                             if (string1.length() != string2.length()){
                                                 return -1;
                                             }
                                             char[] chars1 = string1.toCharArray();
                                             char[] chars2 = string2.toCharArray();
                                             int distance = 0;
                                             for (int i = 0; i < chars1.length; i++) {
                                                 if (chars1[i] != chars2[i]) {
                                                     distance += 1;
                                                 }
                                             }
                                             return distance;
                                         })
                                         .createSymSpell();

Custom character comparison

Let's say you are building a query engine for country names where the input form allows Unicode characters, but the database is all ASCII. You might want searches for Espana to return España entries with distance 0:

CharComparator customCharComparator = new CharComparator() {
    @Override
    public boolean areEqual(char ch1, char ch2) {
        if (ch1 == 'ñ' || ch2 == 'ñ') {
            return ch1 == 'n' || ch2 == 'n';
        }
        return ch1 == ch2;
    }
};
StringDistance damerauLevenshteinOSA = new DamerauLevenshteinOSA(customCharComparator);
SymSpell symSpell = new SymSpellBuilder().setUnigramLexicon(Map.of("España", 10L))
                                         .setStringDistanceAlgorithm(damerauLevenshteinOSA)
                                         .createSymSpell();
List<SuggestItem> suggestions = symSpell.lookup("Espana", Verbosity.ALL);
assertEquals(0, suggestions.get(0).getEditDistance());

Frequency dictionaries in other languages

As in the original SymSpell project, this port contains an English frequency dictionary that you can find at src/test/resources/words.txt If you need a different one, you just need to compute a Map<String, Long> where the key is the word and the value is the frequency in the corpus.

Map<String, Long> unigrams = Arrays.stream("A B A B C A B A C A".split(" "))
                                   .collect(Collectors.groupingBy(String::toLowerCase, Collectors.counting()));
System.out.println(unigrams);
// Output: {a=5, b=3, c=2}

Built With

  • Maven - Dependency Management

Versioning

We use SemVer for versioning.

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Acknowledgments

  • Wolf Garbe

About

Java 8+ zero-dependency port of SymSpell: 1 million times faster through Symmetric Delete spelling correction algorithm

https://medium.com/@wolfgarbe/1000x-faster-spelling-correction-algorithm-2012-8701fcd87a5f

License:MIT License


Languages

Language:Java 100.0%