I'm searching for开发者_开发百科 a "bad" hash function: I'd like to hash strings and put similar strings in one bucket.
Can you give me a hint where to start my research? Some methods or algorithm names...
Your problem is not an easy one. Two ideas:
This solution might be overly complicated but you could try a Fourier transform. Treat your input text as a series of samples of a function and then run a Fourier transform to convert your input to the frequency domain. The low frequency part is the general jist of the text and the high frequency part is the tiny changes.
This is somewhat similar to what jpeg compression does: Throw away the details and just leave the important stuff. If you have two almost-identical images and you jpeg compress them greatly, you usually get the same output.
pHash uses a method similar to this.
Again, this is going to be a pretty complicated way to do it.
Second idea: minHash
The idea for minHash is that you pick some markers that are likely to be the same when the inputs are the same. Then you compute a vector for the outputs of all the markers. If two inputs have similar vectors then the inputs are similar.
For example, count how many times the word "the" appears in the text. If it's even, 0, if it's odd, 1. Now count how many times the word "math" shows up in the text. Again, 0 for even, 1 for odd. Do that for a lot of words.
Now you process all the texts and each one gives you an output like "011100010101" or whatever. If two texts are similar then they will have similar outputs strings, differing by just 1 or two bits. You can use a multi-variate partition trie (MVP) to search the outputs efficiently.
This, too, might be overkill for your problem.
It depends on what you mean by "similar string" ?
But if you look for such a bad one, you have to build it yourself.
Example :
you can create 10 buckets (0 to 9) and group the strings by theirs length mod 10
Use a strcmp() like function and group them by the differences with a defined String
精彩评论