开发者

agrep: only return best match(es)

开发者 https://www.devze.com 2023-02-27 07:22 出处:网络
I\'m using the \'agrep\' function in R, which returns a vector of matches.I would like a function similar to agrep that only returns the best match, or best matches if there are ties.Currently, I am d

I'm using the 'agrep' function in R, which returns a vector of matches. I would like a function similar to agrep that only returns the best match, or best matches if there are ties. Currently, I am doing this using the 'sdist()' function from the package 'cba' on each element of the resulting vector, but this seems very redundant.

/edit: here is the function I'm currently using. I'd like to speed it up, as it seems redundant to calculate distance twice.

library(cba)
word <- 'test'
words <- c('Teest','teeeest','New York City','yeast','text','Test')
ClosestMatch <- function(string,StringVector) {
  matches <- agrep(string,StringVector,value=TRUE)
  distance <- 开发者_如何学Csdists(string,matches,method = "ow",weight = c(1, 0, 2))
  matches <- data.frame(matches,as.numeric(distance))
  matches <- subset(matches,distance==min(distance))
  as.character(matches$matches)
}

ClosestMatch(word,words)


The agrep package uses Levenshtein Distances to match strings. The package RecordLinkage has a C function to calculate the Levenshtein Distance, which can be used directly to speed up your computation. Here is a reworked ClosestMatch function that is around 10x faster

library(RecordLinkage)

ClosestMatch2 = function(string, stringVector){

  distance = levenshteinSim(string, stringVector);
  stringVector[distance == max(distance)]

}


RecordLinkage package was removed from CRAN, use stringdist instead:

library(stringdist)

ClosestMatch2 = function(string, stringVector){

  stringVector[amatch(string, stringVector, maxDist=Inf)]

}
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号