sp1ff / damerau-levenshtein

Comparison of a few algorithms for computing Damerau–Levenshtein distance

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Comparing Damerau-Levenshtein Implementations

Introduction

This repository contains three implementations of Damerau-Levenshtein string distance computation in C++ 17.

The Program

This package builds a program imaginatively named dl. It is installed via the usual autotools incantations. dl will compute Damerau-Levenshtein distance over a corpus of input strings provided on the command line with one of three algorithms (on which more below). For performance benchmarking, you can specify that the program should run over the corpus repeatedly and/or print timing information on completion. Say dl --help for full usage.

Discussion

The Damerau-Levenshtein distance between two strings A & B is the minimal number of insertions, deletions, single-character changes & transpositions needed to transform A into B (e.g. “act” -> “cat” -> “cart”, so the D-L distance between “act” & “cart” is two). In his original paper [1] Damerau claimed that 80% of the errors in the system which gave rise to his work could be accounted for by one of these four errors.

While it’s a popular algorithm for fuzzy searching, there doesn’t seem to be a “go-to” implementations out there in C/C++ (it is a feature request for the boost string library). There are several others on Github, but the ones I examined lacked both documentation on the algorithms used as well as test suites.

This repo contains implementations of what I understand to be the three major algorithms implementing Damerau-Levenshtein distance in C++.

Lowrance & Wagner’s Recursive Relation

This is the algorithm laid out in Wikipedia (at the time of this writing). It is due to Lowrance & Wagner [2] and seems to be the most widely known. In their paper, they prove that under a few (reasonable) conditions, a simple |A| x |B| recurrence relation is sufficient to compute the D-L distance between strings A & B. Their algorithm runs in O(|A|*|B|) time & space.

Ukkonen

Ten years later, Ukkonen substantially improved the performance of this calculation [3]. His paper contained two major advancements. He proved (again under conditions) that in order to compute the D-L distance, one need only compute the recurrence relation in a fairly tight band around its main diagonal (substantially decreasing the number of operations required). Next, he moved from the primary recurrence relation of Lowrance & Wagner to a dual problem of computing f(k,p) which is defined as the maximal index i on diagonal k for which the edit distance between A(1..i) & B(1..j) is p; this doesn’t reduce the time complexity but does reduce that of space. His algorithm runs in O(s*min(|A|,|B|)) (where s is the D-L distance between A & B) and space O(min(s,|A|,|B|))).

Berghel & Roach

Eleven years on, Berghel & Roach improved Ukkonen’s algorithm by deriving even tighter bounds on the region of the region of the recurrence relation that needs to be computed, leading to (in their application) a 42% speedup over Ukkonen. If s is the edit distance from A to B, and m & n their respective string lengths, and WLOG n >= m, define p to be 1/2s - 1/(2*(n-m)). Then the worst-case running time of this algorithm is O(n*p) and the expected running time is O(n+(p*s)).

After reading this paper, I spent some time searching the literature for references to it (looking for further improvements). I didn’t find any directly-related improvements. Later references either cited it as the best known to the author(s) for computing Damerau-Levenshtein distance, or as a point of interest in developing different string metrics.

In my benchmarking, I’ve found it important to avoid the use of std::vector, which surprises me. I’ve instead just used flat arrays & done the two-dimensional indexing manually.

All the other references check for common prefixes before starting the algorithm, which I haven’t tested, yet.

References

  1. Fred J. Damerau, A Technique for Computer Detection and Correction of Spelling Errors, Communications of the ACM, 7 (1964) No. 3, 171-176.
  2. Roy Lowrance and Robert A. Wagner, An Extension of the String-to-String Correction Problem, Journal of the Association for Computing Machinery, 22 (1975), No 2, 177-183.
  3. Esko Ukkonen, Algorithms for Approximate String Matching, Information and Control, 64 (1985) 100–118.
  4. Hal Berghel and David Roach, An Extension of Ukkonen’s Enhanced Dynamic Programming ASM Algorithm, ACM Transactions on Information Systems, 14 (1996) No. 1, 94-106.

About

Comparison of a few algorithms for computing Damerau–Levenshtein distance

License:GNU General Public License v3.0


Languages

Language:Shell 60.6%Language:C++ 35.4%Language:Makefile 2.7%Language:M4 1.3%