I'm trying to come up with the fastest way to make search suggestions. At first I thought a Levenstein UDF function combined with a mysql table would do the job. But using levenshtein, mysql would have to go over every row in the table (tons of words) which would make the query really slow.
Now I recently installed and started to use Sphinx (http://sphinxsearch.com/) for fulltext searching mainly because of its performance and tight mysql integration with SphinxSE.
So I asked myself if I can implement a "did you mean" algorithm using sphinx to boost performance somehow, and I think I found a simple one. Basically i take all the keywords I want to correct, put a space between each letter, then put it in the sphinx index. If the word is 'keyword' it becomes 'k e y w o r d'. Now when the user enters a word I split it in to letters and search in the sphinx index for a record (I just need one) that matches any of the letters provided. The best part is that sphinx is very good on calculating relevance (weight) of the matched rows, so the best match will always 开发者_高级运维have the biggest weight (I think). It also accounts for word (letters in my case) positions so the best match will be in that order.
With the sphinx query I get the most similar word in my keywords list. Then I check it with php using the extended Levenshtain distance which accounts for rearranged letters https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance . If the string distance is lower than 2 (and != 0) then suggest the word. Otherwise don't suggest anything.
Is there a problem with my idea? Something I didn't think of? Any expected glitches with the sphinx query, and quirks with the sphinx relevance calculation which woudn't give the best match? Please correct me if I'm mistaking somewhere.
I can't see a problem with your idea. Go for it. Just to point out that your method is only relevant if you want to override the builtin behaviour that is very similar to LD.
For example, with sphinx 1.10-beta, you can specify min_infix_len and expand_keywords and use sphinx's builtin weighting methods (BM25 and some proprietary code) for good results. http://sphinxsearch.com/blog/2010/08/17/how-sphinx-relevance-ranking-works/
Don't forget to memcache these queries, and create a warm-up script.
You could just log every search query that's entered, along with a next search query that the user enters.
Lets assume that lots of users search for rhinosorous but actually mean rhinoceros. Because users will correct their query, this will mean there will be a lot of rhinosorous queries with rhinoceros as the next query.
You can select suggestions like this:
SELECT id, query, next_query, COUNT(id) AS count FROM queries GROUP BY query ORDER BY COUNT(id) DESC
If the top result has a count
that's a high % of all queries for that keyword, display a message.
I haven't tested this, its just an idea.
I think it will be interesting to you to read what Andrew Aksyonoff (author of Sphinx) thinks about implementation of this task via Sphinx - http://habrahabr.ru/blogs/sphinx/61807/ (use translator to translate from russian)
精彩评论