Are there any open-source implementations of the Mercator Web Crawler [closed]_问答_开发者

Are there any open-source implementations of the Mercator Web Crawler [closed]

开发者 https://www.devze.com 2023-02-23 13:17 出处：网络

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.

相关专题：web-crawler

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.

We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.

开发者_C百科

Closed 2 years ago.

Improve this question

Marc Najork and Allan Heydon have written an excellent paper on their Java, scalable and extensible web crawler called Mercator.

Here are some resources on the Mercator web crawler:

Mercator Presentation (pdf)
Mercator Introduction (pdf)
Mercator Web Crawler Paper (pdf)
First result in Google for the query: "Web Crawling Contents Najork pdf"

Has anybody seen any implementations of the crawler (preferably java)?

Update:

~~I'm~~ I was having trouble with the links, ~~I'm going to try to get better links for the referenced papers.~~ but I think I've fixed them now.

I've found a couple of Java crawlers that are supposed to be pretty close to Mercator:

Nutch is multithreaded and distributed.
Heritrix is only multithreaded.

Other references are welcome.

Crawler4j - http://code.google.com/p/crawler4j/
WebSPHINX - http://www.cs.cmu.edu/~rcm/websphinx/

StormCrawler is an open source SDK for building low-latency, distributed web crawlers with Apache Storm. The project is under Apache license v2 and consists of a collection of reusable resources and components, written mostly in Java.