We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
开发者_C百科Closed 2 years ago.
Improve this questionMarc Najork and Allan Heydon have written an excellent paper on their Java, scalable and extensible web crawler called Mercator.
Here are some resources on the Mercator web crawler:
- Mercator Presentation (pdf)
- Mercator Introduction (pdf)
- Mercator Web Crawler Paper (pdf) First result in Google for the query: "Web Crawling Contents Najork pdf"
Has anybody seen any implementations of the crawler (preferably java)?
Update:
I've found a couple of Java crawlers that are supposed to be pretty close to Mercator:
- Nutch is multithreaded and distributed.
- Heritrix is only multithreaded.
Other references are welcome.
- Crawler4j - http://code.google.com/p/crawler4j/
- WebSPHINX - http://www.cs.cmu.edu/~rcm/websphinx/
StormCrawler is an open source SDK for building low-latency, distributed web crawlers with Apache Storm. The project is under Apache license v2 and consists of a collection of reusable resources and components, written mostly in Java.
精彩评论