开发者

guide on crawling the entire web?

开发者 https://www.devze.com 2022-12-17 03:05 出处:网络
i just had this thought, and was wondering if it\'s possible to crawl the entire web (just like the big boys!) on a single dedi开发者_Python百科cated server (like Core2Duo, 8gig ram, 750gb disk 100mbp

i just had this thought, and was wondering if it's possible to crawl the entire web (just like the big boys!) on a single dedi开发者_Python百科cated server (like Core2Duo, 8gig ram, 750gb disk 100mbps) .

I've come across a paper where this was done....but i cannot recall this paper's title. it was like about crawling the entire web on a single dedicated server using some statistical model.

Anyways, imagine starting with just around 10,000 seed URLs, and doing exhaustive crawl....

is it possible ?

I am in need of crawling the web but limited to a dedicated server. how can i do this, is there an open source solution out there already ?

for example see this real time search engine. http://crawlrapidshare.com the results are exteremely good and freshly updated....how are they doing this ?


Crawling the Web is conceptually simple. Treat the Web as a very complicated directed graph. Each page is a node. Each link is a directed edge.

You could start with the assumption that a single well-chosen starting point will eventually lead to every other point (eventually). This won't be strictly true but in practice I think you'll find it's mostly true. Still chances are you'll need multiple (maybe thousands) of starting points.

You will want to make sure you don't traverse the same page twice (within a single traversal). In practice the traversal will take so long that it's merely a question of how long before you come back to a particular node and also how you detect and deal with changes (meaning the second time you come to a page it may have changed).

The killer will be how much data you need to store and what you want to do with it once you've got it.


Sorry to revive this thread after so long, but I just wanted to point out that if you are just in need of an extremely large web dataset, there is a much easier way to get it than to attempt crawling the entire web yourself with a single server: just download the free crawl database provided by the Common Crawl project. In their words:

We build and maintain an open repository of web crawl data that can be accessed and analyzed by anyone.

As of today their database is petabytes in size, and contains billions of pages (trillions of links). Just download it, and perform whatever analysis you're interested in there.


I believe the paper you're referring to is "IRLbot: Scaling to 6 Billion Pages and Beyond". This was a single server web crawler written by students at Texas A&M.

Leaving aside issues of bandwidth, disk space, crawling strategies, robots.txt/politeness - the main question I've got is "why?" Crawling the entire web means you're using shared resources from many millions of web servers. Currently most webmasters allow bots to crawl them, provided they play nice and obey implicit and explicit rules for polite crawling.

But each high-volume bot that hammers a site without obvious benefit results in a few more sites shutting the door to everything besides the big boys (Google, Yahoo, Bing, etc). So you really want to ask the why question before spending too much time on the how.

Assuming you really do need to crawl a large portion of the web on a single server, then you'd need to get a fatter pipe, lots more storage space (e.g. assume 2K compressed text per page, so 2TB for 1B pages), lots more RAM, at least 4 real cores, etc. The IRLbot paper would be your best guide. You might also want to look at the crawler-commons project for reusable chunks of Java code.

And a final word of caution. It's easy for an innocent mistake to trigger problems for a web site, at which time you'll be on the receiving end of an angry webmaster flame. So make sure you've got thick skin :)


See this for an alternative solution, depending on what you'd be looking to do with that much data (even if it were possible): http://searchenginewatch.com/2156241

... EDIT: Also, dont forget, the web is changing all the time, so even relatively small-sized crawling operations (like classifieds sites that aggregate listings from lots of sources) refresh their crawls on a cycle, say, like a 24-hour cycle. That's when website owners may or may not start being inconvenienced by the load your crawler puts on their servers. And then depending on how you use the crawled content, you've got de-duping to think about because you need to teach your systems to recognise whether the crawl results from yesterday are different from those of today etc... gets very "fuzzy", not to mention the computing power needed.


Bloom filter for detecting where you have been.

There will be false positives but you can get around this by implementing multiple Bloom filters and rotating which Bloom Filter gets added to and creating a filter of impressive length.

http://en.wikipedia.org/wiki/Bloom_filter


I bet it is possible. You only need to have a quantum CPU and quantum RAM.

Seriously, a single server wouldn't be able to catch up with the growth of the entire web. Google uses a huge farm of servers (counted in tens, if not hundreds of thousands), and it can't provide you with immediate indexing.

I guess if you're limited to a single server and are in need of crawling the entire web, you're really in need of results of that crawl. Instead of focusing on "how to crawl the web", focus on "how to extract the data you need using Google". A good starting point for that would be: Google AJAX Search API.


Sounds possible but the two real problems will be network connection and hard drive space. Speaking as someone who knows almost nothing about web crawling, i'd start with several terabytes of storage and work my way up as i amass more information, and a good broadband internet connection. A deep pocket is a must for this!


I just wonder the whole Internet should be larger than 750 GB. Moreover, the data structure designed to index the web also takes a lot of storage.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号