seatgeek / fuzzywuzzy

Fuzzy String Matching in Python

Home Page:http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Difflib and python-Levenshtein give different ratios in some cases

theodickson opened this issue · comments

This is probably known and accepted, but since it's not mentioned in the docs I'll raise this anyway. In certain edge cases, difflib and python-Levenshtein give different ratios.
E.g.:

>>> from fuzzywuzzy import fuzz, StringMatcher
>>> import difflib

#As long as python-Levenshtein is available, that will be used for the following:
>>> fuzz.ratio("ababab", "aaaaab")
67

#Switch to difflib:
>>> fuzz.SequenceMatcher = difflib.SequenceMatcher
>>> fuzz.ratio("ababab", "aaaaab")
33

I don't know how python-Levenshtein works, but difflib first chooses the left-most longest block in the first sequence that matches any block in the second sequence. Then among the longest matches in the second sequence to this chosen block, it will choose the left-most.

What this means is that it is forced to choose the first "ab" in the first string, and then matches it to the only "ab" in the second, at the end. This means it cannot recurse either to the left or the right to look for more matching blocks, since this match is at the very beginning of the first string and at the very end of the second. So the ratio calculated is 33, since it has matched one third (4/12) of the total characters.

I'm assuming that in python-Levenshtein, it makes a decision about which matches to choose based not just on maximality and left-most-ness, but also on whether the pair of matches chosen allows subsequent recursion. This is because it scores 67, so must have matched eight of 12 characters, i.e. it must have matched "ab" and then two subsequent "a"s in each sequence, meaning it must have chosen the last "ab" in the first string. Either this or it can recurse to opposite sides in the two sequences but that surely isn't happening.

To show this, if we change the second sequence to "abaaaa", difflib will also score 67 (since it matches the first two characters of each sequence then recurses to the right). See as follows:

>>> fuzz.ratio("ababab", "abaaaa")
67

#And switching pack to python-Levenshtein, no change:
>>> fuzz.SequenceMatcher = fuzzywuzzy.StringMatcher.StringMatcher
>>> fuzz.ratio("ababab", "abaaaa")
67

Yeah, I guess they are slightly different sometimes. For our use-case, python-Levenshtein works well, and it's what we suggest people use in production.

The difference is because difflib uses the Ratcliff-Obershelp algorithm, not the Levenshtein distance. Probably the readme should be updated, because not always Levenshtein distance is used.