I need a search engine for a website I am building. I decided to try my own using php and mysql. Currently it looks like the viable options is to create three tables.
One for words, one for pages, and one reference table. Then when I am inserting a new article I would scan the text and put the separate words in the words table and refernce those words on the third table.
In the end when a search is made. The script should return the pages with the most indexed words for a given word.
However it looks like this approach can only return results depend开发者_运维问答ing on the number of keywords. The more a keyword is used in an article the more higher it will appear on the result page. So an article with less keywords maybe more related to the search but will be placed lower on the results.
The question would be is there a better way to create a custom search engine using php/mysql? Also if you do not have access to server to install search engines like Sphinx what is the best way to tackle this problem?
I've built a search engine in much the same way, but I built a cross table, linking each word to each page in which it occurred. In that table, I also stored the number of times the word appeared in the page in relation to the length of the page. I calculated if you like, the percentage of the words on the page that were that word. That makes it easier to apply a weight to your search result. But unfortunately it is hard to determine if a page is more relevant in other ways. Google uses some tricks like the distance between two keywords on a page. If they are close to each other, they are probably related. If a keyword is higher in the page, it is probably more important, and so on.
But also, Google uses a totally different database structure that is better built for these kind of queries. It may be hard to build that in MySQL.
You can try if the FullText indexing of MySQL is any help to you. It indexes your pages and you can query using MATCH which returns a score for each row. I don't know exactly what formulas are used there, but it seems to be pretty smart.
If all of your pages are public you might want to consider using Google Custom Search or something like that. It will save you a lot of time.
As others have suggested, don't roll your own; SQL is no good for searching. We use a system based on Solr using the Solr PHP Client library. You'll get far better performance, support for much more powerful boolean queryies (e.g this AND that AND (this OR that) etc), searching within documents (e.g. pdfs, word, xls etc) through Tika and so on.
- http://lucene.apache.org/solr/
- http://code.google.com/p/solr-php-client/
If you want to crawl your own website, you can also look into nutch.
- http://nutch.apache.org/
I Second El Yobo, if you are going for a full blown search engine you will have better luck with lucene clients but if you are looking for a quick solution google cse is the best.
精彩评论