Simple vectorised Fuzzy Matching function in R based on Levenshtein distance
Base R function agrep()
has several limitation in case vectorize matching:
- Returns all matches:
1.1 Either with number of match (1)
1.2 Either the match case (2)
fruits <- c("apple", "apples", "aple", "melone", "applejuice", "peanapple", "pear")
agrep("apple", fruits) #(1)
#>[1] 1 2 3 5 6
agrep("apple", fruits, value = TRUE) #(2)
#>[1] "apple" "apples" "aple" "applejuice" "peanapple"
- Returns "matches substrings of each element of x" (see documentation
agrep()
).
But someone is likely to like strict matching: "apple", "apples","aple", "applejuice", "peanapple" - Enythere someone also likes vectorize form where in f(x, y): x, y are vectors for matching and f(x, y) returns vector of best matches from y in sequance of x
- f(x, y) should return
NA
if Levenshtein distance of matching is more than some max value
Simple function get_match()
based on adist()
:
library("purrr")
get_match <- function(x, y, max = 3, cost = 1) {
dist_mtrx <- adist(x, y, partial = FALSE, costs = cost)
dimnames(dist_mtrx) <- list(x, y)
get_best <- function(n, dist_mtrx, max){
min_dist <- dist_mtrx[n, ] %>% min()
if(min_dist < max) res <- dist_mtrx[n, ] %>% which.min() %>% names()
else res <- NA
return(res)
}
1:nrow(dist_mtrx) %>%
map_chr(~get_best(., dist_mtrx, max = max))
}
#test result
fruits1 <- stringr::fruit
fruits2 <- c("pears","apples", "melone", "per", "apple", "orange")
get_match(fruits2, fruits1)
#>[1] "pear" "apple" NA "pear" "apple" "orange"
Same result but a little bit faster using only base R
get_match <- function(x, y, max = 3, cost = 1) {
dist_mtrx <- adist(x, y, partial = FALSE, costs = cost)
dimnames(dist_mtrx) <- list(x, y)
get_best <- function(n, dist_mtrx, max){
min_dist <- min(dist_mtrx[n, ])
if(min_dist < max) res <- names(which.min(dist_mtrx[n, ]))
else res <- NA
return(res)
}
sapply(X = 1:nrow(dist_mtrx), FUN = function(x) get_best(x, dist_mtrx, max = max))
}
#test result
fruits1 <- stringr::fruit
fruits1
fruits2 <- c("pears","apples", "melone", "per", "apple", "orange")
get_match(fruits2, fruits1)
#>[1] "pear" "apple" NA "pear" "apple" "orange"
© 2020 GitHub, Inc.