Im going to make my own search engine.
When searching about search engine, crawler, and so on, I confused about Nutch.
I don’t understand what is Nutch. Is it for internal use like Lucene (correct me if Im wrong) or a framework for creating a search engine (example:google, bing, yahoo)?
Nutch is a full featured search engine - it can crawl external web sites, and it understands and respects robots.txt.
http://nutch.apache.org/about.html
Overview Nutch is open source web-search software. It builds on Lucene and Solr, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.
Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster
The system can be enhanced (eg other document formats can be parsed) using a plugin mechanism.
For more information about Nutch, please see the Nutch wiki.
Nutch is a ready-made, configurable web crawler with a Java Servlet for performing searches. If you wanted to do this as a project, Nutch probably does too much since all that's left is creating the pages for entering searches and displaying results.
精彩评论