fuzzyr
provides fuzzy string matching in R. Whereas stringdist
provides basic comparison methods, most useful for single-word strings, fuzzyr
provides methods for comparings more complex strings, like organization names, recording artists, and a host of other situations. The package will be on CRAN in the near future. For now, install with devtools::install_github('nickbloom/fuzzyr')
.
For now, fuzzyr
provides four basic comparison functions. Like the fuzzywuzzy module for python, all fuzzyr
functions return string similarity scores between 0 and 100.
fuzz_ratio
is the simplest function. It simply returns the ratio of the number of shared characters in two strings to the length of the two strings.
library('fuzzyr')
fuzz_ratio("suny erie county med ctr at buffalo", "suny buffalo erie county medical ctr")
## [1] 71
fuzz_ratio("lorem ipsum dolor sit suny", "suny buffalo erie county medical ctr")
## [1] 50
substr_ratio
returns the score of two strings based on the ratio of shared words to the length of both strings combined. This function also checks whether unshared words are substrings of each other (e.g. "med" and "medical" in the first example below), and removes them from the denominator in the score calculation.
substr_ratio("suny erie county med ctr at buffalo", "suny buffalo erie county medical ctr")
## [1] 100
substr_ratio("lorem ipsum dolor sit suny", "suny buffalo erie county medical ctr")
## [1] 8
Both token_set_ratio
and token_sort_ratio
treat each word in a string as a "token." These tokens are compared across the two strings.
token_set_ratio
compares two strings, and returns the ratio of set of shared tokens (i.e. only the words appearing in both strings) to the length of the first string.
token_set_ratio("suny erie county med ctr at buffalo", "suny buffalo erie county medical ctr")
## [1] 88
token_set_ratio("lorem ipsum dolor sit suny", "suny buffalo erie county medical ctr")
## [1] 18
token_sort_ratio
does the same things as token_set_ratio
, but sorts the tokens alphabetically first.
token_sort_ratio("suny erie county med ctr at buffalo", "suny buffalo erie county medical ctr")
## [1] 93
token_sort_ratio("lorem ipsum dolor sit suny", "suny buffalo erie county medical ctr")
## [1] 60