Comparing string distance based on precomputed hashes_问答_开发者

Comparing string distance based on precomputed hashes

开发者 https://www.devze.com 2023-01-11 03:26 出处：网络

I have a large list (over 200,000) of strings that I\'d like to compare to a given string. The given string is inserted by a u开发者_开发问答ser, so it may be slightly incorrect.

I have a large list (over 200,000) of strings that I'd like to compare to a given string. The given string is inserted by a u开发者_开发问答ser, so it may be slightly incorrect.

What I was hoping to do was create some kind of precomputed hash on each string on adding it to the list. This hash would contain information such as string length, addition of all the characters etc.

My question is, does something like this already exist? Surely there would be something that lets me avoid running Levenshtein distance on every string in the list?

Or maybe there's a third option I haven't thought of yet?

Sounds like you want to use a fuzzy hash of some sort. There are lots of hash functions available that can do things like this. The classic old "SOUNDEX" algorithm might even work.

Another thought - if you estimate that the probability of an incorrect entry is low, then you might actually be fine having a direct hit 99.9% of the time, falling back to SOUNDEX which might catch 90% of the remaining cases and then searching the whole list for the remaining 0.01% of the time.

Also worth checking this discussion: How to find best fuzzy match for a string in a large string database