nickbloom / fuzzyr

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

fuzzyr: Fuzzy String Matching for R

fuzzyr provides fuzzy string matching in R. Whereas stringdist provides basic comparison methods, most useful for single-word strings, fuzzyr provides methods for comparings more complex strings, like organization names, recording artists, and a host of other situations. The package will be on CRAN in the near future. For now, install with devtools::install_github('nickbloom/fuzzyr').

   

For now, fuzzyr provides four basic comparison functions. Like the fuzzywuzzy module for python, all fuzzyr functions return string similarity scores between 0 and 100.

fuzz_ratio

fuzz_ratio is the simplest function. It simply returns the ratio of the number of shared characters in two strings to the length of the two strings.

library('fuzzyr')
fuzz_ratio("suny erie county med ctr at buffalo", "suny buffalo erie county medical ctr")
## [1] 71
fuzz_ratio("lorem ipsum dolor sit suny", "suny buffalo erie county medical ctr")
## [1] 50

substr_ratio

substr_ratio returns the score of two strings based on the ratio of shared words to the length of both strings combined. This function also checks whether unshared words are substrings of each other (e.g. "med" and "medical" in the first example below), and removes them from the denominator in the score calculation.

substr_ratio("suny erie county med ctr at buffalo", "suny buffalo erie county medical ctr")
## [1] 100
substr_ratio("lorem ipsum dolor sit suny", "suny buffalo erie county medical ctr")
## [1] 8

token_set_ratio

Both token_set_ratio and token_sort_ratio treat each word in a string as a "token." These tokens are compared across the two strings.

token_set_ratio compares two strings, and returns the ratio of set of shared tokens (i.e. only the words appearing in both strings) to the length of the first string.

token_set_ratio("suny erie county med ctr at buffalo", "suny buffalo erie county medical ctr")
## [1] 88
token_set_ratio("lorem ipsum dolor sit suny", "suny buffalo erie county medical ctr")
## [1] 18

token_sort_ratio

token_sort_ratio does the same things as token_set_ratio, but sorts the tokens alphabetically first.

token_sort_ratio("suny erie county med ctr at buffalo", "suny buffalo erie county medical ctr")
## [1] 93
token_sort_ratio("lorem ipsum dolor sit suny", "suny buffalo erie county medical ctr")
## [1] 60

About


Languages

Language:R 100.0%