This question has been asked in various ways before, but I'm wondering if people who have experience with automatic search term suggestion could offer advice on the most useful and efficient approaches. Here's the scenario:
I'm just starting on a website for a book that is a dictionary of terms (roughly 1,000 entries, with 300 word explanations on average), many of which are fairly obscure, and it is likely that many visitors to the site would not know how to spell the words. The publisher wants to make full-text search available for every entry. So, I'm hoping to implement a search engine with spelling correction. The main site will probably be done in a PHP framework (or possibly Django) with a MySQL database.
Can anyone with experience in this area give advice on the following:
- With a set corpus of this nature, should I be using something like Lucene or Sphinx for the search engine?
- As far as I can tell, neither of these has a built-in suggestion function. So it seems I will need to integrate one or more of the following. What are the advantages / disadvantages of:
- Suggestion requests through Google's search API
- A phonetic comparison algorithm like metaphone() in PHP
- A spell checking system like Aspell
- A simpler spelling script such as Peter Norvig's
- A Levenshtein function
I'm concerned about the specificity of my corpus, and don't want Google to start suggesting things that have nothing to do with this book. I'm also not sure whether I should try to use both a metaphone comparison and a Levenshtein comparison, or some other combination of techniques to capture both typos 开发者_运维问答and attempts at phonetic spelling.
You might want to consider Apache Solr, which is a web service encapsulation of Lucene, and runs in a J2EE container like Tomcat. You'll get term suggestion, spell check, porting, stemming and much more. It's really very nice.
See here for a full listing of its features relating to queries.
There are Django and PHP libraries for Solr.
I wouldn't recommend using Google Suggest for such a specialised corpus anyway, and with Solr you won't need it.
Hope this helps.
精彩评论